Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Started by Peter Geogheganover 2 years ago78 messages

pg@bowt.ie

over 2 years ago

1 attachment(s)

I've been working on a variety of improvements to nbtree's native
ScalarArrayOpExpr execution. This builds on Tom's work in commit
9e8da0f7.

Attached patch is still at the prototype stage. I'm posting it as v1 a
little earlier than I usually would because there has been much back
and forth about it on a couple of other threads involving Tomas Vondra
and Jeff Davis -- seems like it would be easier to discuss with
working code available.

The patch adds two closely related enhancements to ScalarArrayOp
execution by nbtree:

1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
"advance the scan's array keys locally", which sometimes avoids
significant amounts of unneeded pinning/locking of the same set of
index pages.

SAOP index scans become capable of eliding primitive index scans for
the next set of array keys in line in cases where it isn't truly
necessary to descend the B-Tree again. Index scans are now capable of
"sticking with the existing leaf page for now" when it is determined
that the end of the current set of array keys is physically close to
the start of the next set of array keys (the next set in line to be
materialized by the _bt_advance_array_keys state machine). This is
often possible.

Naturally, we still prefer to advance the array keys in the
traditional way ("globally") much of the time. That means we'll
perform another _bt_first/_bt_search descent of the index, starting a
new primitive index scan. Whether we try to skip pages on the leaf
level or stick with the current primitive index scan (by advancing
array keys locally) is likely to vary a great deal. Even during the
same index scan. Everything is decided dynamically, which is the only
approach that really makes sense.

This optimization can significantly lower the number of buffers pinned
and locked in cases with significant locality, and/or with many array
keys with no matches. The savings (when measured in buffers
pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
has shown that transaction throughput for variants of "pgbench -S"
designed to stress the implementation (hundreds of array constants)
under concurrent load can have up to 5.5x higher transaction
throughput with the patch. Less extreme cases (10 array constants,
spaced apart) see about a 20% improvement in throughput. There are
similar improvements to latency for the patch, in each case.

2. The optimizer now produces index paths with multiple SAOP clauses
(or other clauses we can safely treat as "equality constraints'') on
each of the leading columns from a composite index -- all while
preserving index ordering/useful pathkeys in most cases.

The nbtree work from item 1 is useful even with the simplest IN() list
query involving a scan of a single column index. Obviously, it's very
inefficient for the nbtree code to use 100 primitive index scans when
1 is sufficient. But that's not really why I'm pursuing this project.
My real goal is to implement (or to enable the implementation of) a
whole family of useful techniques for multi-column indexes. I call
these "MDAM techniques", after the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1]http://vldb.org/conf/1995/P710.PDF[2]/messages/by-id/2587523.1647982549@sss.pgh.pa.us-- MDAM is short for "multidimensional
access method". In the context of the paper, "dimension" refers to
dimensions in a decision support system.

The most compelling cases for the patch all involve multiple index
columns with multiple SAOP clauses (especially where each column
represents a separate "dimension", in the DSS sense). It's important
that index sort be preserved whenever possible, too. Sometimes this is
directly useful (e.g., because the query has an ORDER BY), but it's
always indirectly needed, on the nbtree side (when the optimizations
are applicable at all). The new nbtree code now has special
requirements surrounding SAOP search type scan keys with composite
indexes. These requirements make changes in the optimizer all but
essential.

Index order
===========

As I said, there are cases where preserving index order is immediately
and obviously useful, in and of itself. Let's start there.

Here's a test case that you can run against the regression test database:

pg@regression:5432 =# create index order_by_saop on tenk1(two,four,twenty);
CREATE INDEX

pg@regression:5432 =# EXPLAIN (ANALYZE, BUFFERS)
select ctid, thousand from tenk1
where two in (0,1) and four in (1,2) and twenty in (1,2)
order by two, four, twenty limit 20;

With the patch, this query gets 13 buffer hits. On the master branch,
it gets 1377 buffer hits -- which exceeds the number you'll get from a
sequential scan by about 4x. No coaxing was required to get the
planner to produce this plan on the master branch. Almost all of the
savings shown here are related to heap page buffer hits -- the nbtree
changes don't directly help in this particular example (strictly
speaking, you only need the optimizer changes to get this result).

Obviously, the immediate reason why the patch wins by so much is
because it produces a plan that allows the LIMIT to terminate the scan
far sooner. Benoit Tigeot (CC'd) happened to run into this issue
organically -- that was also due to heap hits, a LIMIT, and so on. As
luck would have it, I stumbled upon his problem report (in the
Postgres slack channel) while I was working on this patch. He produced
a fairly complete test case, which was helpful [3]https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d -- Peter Geoghegan. This example is
more or less just a distillation of his test case, designed to be easy
for a Postgres hacker to try out for themselves.

There are also variants of this query where a LIMIT isn't the crucial
factor, and where index page hits are the problem. This query uses an
index-only scan, both on master and with the patch (same index as
before):

select count(*), two, four, twenty
from tenk1
where two in (0, 1) and four in (1, 2, 3, 4) and
twenty in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14,15)
group by two, four, twenty
order by two, four, twenty;

The patch gets 18 buffer hits for this query. That outcome makes
intuitive sense, since this query is highly unselective -- it's
approaching the selectivity of the query "select count(*) from tenk1".
The simple count(*) query gets 19 buffer hits for its own index-only
scan, confirming that the patch managed to skip only one or two leaf
pages in the complicated "group by" variant of the count(*) query.
Overall, the GroupAggregate plan used by the patch is slower than the
simple count(*) case (despite touching fewer pages). But both plans
have *approximately* the same execution cost, which makes sense, since
they both have very similar selectivities.

The master branch gets 245 buffer hits for the same group by query.
This is almost as many hits as a sequential scan would require -- even
though there are precisely zero heap accesses needed by the underlying
index-only scan. As with the first example, no planner coaxing was
required to get this outcome on master. It is inherently very
difficult to predict how selective a query like this will be using
conventional statistics. But that's not actually the problem in this
example -- the planner gets that part right, on this occasion. The
real problem is that there is a multiplicative factor to worry about
on master, when executing multiple SAOPs. That makes it almost
impossible to predict the number of pages we'll pin. While with the
patch, scans with multiple SAOPs are often fairly similar to scans
that happen to just have one on the leading column.

With the patch, it is simply impossible for an SAOP index scan to
visit any single leaf page more than once. Just like a conventional
index scan. Whereas right now, on master, using more than one SAOP
clause for a multi column index seems to me to be a wildly risky
proposition. You can easily have cases that work just fine on master,
while only slight variations of the same query see costs explode
(especially likely with a LIMIT). ISTM that there is significant value
in knowing for sure in having a pretty accurate idea of the worst case
in the planner.

Giving nbtree the ability to skip or not skip dynamically, based on
actual conditions in the index (not on statistics), seems like it has
a lot of potential as a way of improving performance *stability*.
Personally I'm most interested in this aspect of the project.

Note: we can visit internal pages more than once, but that seems to
make a negligible difference to the overall cost profile of scans. Our
policy is to not charge an I/O cost for those pages. Plus, the number
of internal page access is dramatically reduced (it's just not
guaranteed that there won't be any repeat accesses for internal pages,
is all).

Note also: there are hard-to-pin-down interactions between the
immediate problem on the nbtree side, and the use of filter quals
rather than true index quals, where the use of index quals is possible
in principle. Some problematic cases see excessive amounts of heap
page hits only (as with my first example query). Other problematic
cases see excessive amounts of index page hits, with little to no
impact on heap page hits at all (as with my second example query).
Some combination of the two is also possible.

Safety
======

As mentioned already, the ability to "advance the current set of array
keys locally" during a scan (the nbtree work in item 1) actually
relies the optimizer work in item 2 -- it's not just a question of
unlocking the potential of the nbtree work. Now I'll discuss those
aspects in a bit more detail.

Without the optimizer work, nbtree will produce wrong answers to
queries, in a way that resembles the complaint addressed by historical
bugfix commit 807a40c5. This incorrect behavior (if the optimizer were
to permit it) would only be seen when there are multiple
arrays/columns, and an inequality on a leading column -- just like
with that historical bug. (It works both ways, though -- the nbtree
changes also make the optimizer changes safe by limiting the worst
case, which would otherwise be too much of a risk to countenance. You
can't separate one from the other.)

The primary change on the optimizer side is the addition of logic to
differentiate between the following two cases when building an index
path in indxpath.c:

* Unsafe: Cases where it's fundamentally unsafe to treat
multi-column-with-SAOP-clause index paths as returning tuples in a
useful sort order.

For example, the test case committed as part of that bugfix involves
an inequality, so it continues to be treated as unsafe.

* Safe: Cases where (at least in theory) bugfix commit 807a40c5 went
further than it really had to.

Those cases get to use the optimization, and usually get to have
useful path keys.

My optimizer changes are very kludgey. I came up with various ad-hoc
rules to distinguish between the safe and unsafe cases, without ever
really placing those changes into some kind of larger framework. That
was enough to validate the general approach in nbtree, but it
certainly has problems -- glaring problems. The biggest problem of all
may be my whole "safe vs unsafe" framing itself. I know that many of
the ostensibly unsafe cases are in fact safe (with the right
infrastructure in place), because the MDAM paper says just that. The
optimizer can't support inequalities right now, but the paper
describes how to support "NOT IN( )" lists -- clearly an inequality!
The current ad-hoc rules are at best incomplete, and at worst are
addressing the problem in fundamentally the wrong way.

CNF -> DNF conversion
=====================

Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".

In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).

The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]/messages/by-id/2587523.1647982549@sss.pgh.pa.us. Much work remains here.

Skip Scan
=========

MDAM encompasses something that people tend to call "skip scan" --
terminology with a great deal of baggage. These days I prefer to call
it "filling in missing key predicates", per the paper. That's much
more descriptive, and makes it less likely that people will conflate
the techniques with InnoDB style "loose Index scans" -- the latter is
a much more specialized/targeted optimization. (I now believe that
these are very different things, though I was thrown off by the
superficial similarities for a long time. It's pretty confusing.)

I see this work as a key enabler of "filling in missing key
predicates". MDAM describes how to implement this technique by
applying the same principles that it applies everywhere else: it
proposes a scheme that converts predicates from CNF to DNF. With just
a little extra logic required to do index probes to feed the
DNF-generating state machine, on demand.

More concretely, in Postgres terms: skip scan can be implemented by
inventing a new placeholder clause that can be composed alongside
ScalarArrayOpExprs, in the same way that multiple ScalarArrayOpExprs
can be composed together in the patch already. I'm thinking of a type
of clause that makes the nbtree code materialize a set of "array keys"
for a SK_SEARCHARRAY scan key dynamically, via ad-hoc index probes
(perhaps static approaches would be better for types like boolean,
which the paper contemplates). It should be possible to teach the
_bt_advance_array_keys state machine to generate those values in
approximately the same fashion as it already does for
ScalarArrayOpExprs -- and, it shouldn't be too hard to do it in a
localized fashion, allowing everything else to continue to work in the
same way without any special concern. This separation of concerns is a
nice consequence of the way that the MDAM design really leverages
preprocessing/DNF for everything.

Both types of clauses can be treated as part of a general class of
ScalarArrayOpExpr-like clauses. Making the rules around
"composability" simple will be important.

Although skip scan gets a lot of attention, it's not necessarily the
most compelling MDAM technique. It's also not especially challenging
to implement on top of everything else. It really isn't that special.
Right now I'm focussed on the big picture, in any case. I want to
emphasize the very general nature of these techniques. Although I'm
focussed on SOAPs in the short term, many queries that don't make use
of SAOPs should ultimately see similar benefits. For example, the
paper also describes transformations that apply to BETWEEN/range
predicates. We might end up needing a third type of expression for
those. They're all just DNF single value predicates, under the hood.

Thoughts?

[1]: http://vldb.org/conf/1995/P710.PDF
[2]: /messages/by-id/2587523.1647982549@sss.pgh.pa.us
[3]: https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v1-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v1-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From d4459fe464d41bdd3fa5e81b310b095560f4f5b0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v1] Enhance nbtree ScalarArrayOp execution.

Teach nbtree to avoid primitive index when executing a scan with
ScalarArrayOp keys.
---
 src/include/access/nbtree.h           |  46 +-
 src/backend/access/nbtree/nbtree.c    |  21 +-
 src/backend/access/nbtree/nbtsearch.c |  85 +++-
 src/backend/access/nbtree/nbtutils.c  | 589 +++++++++++++++++++++++++-
 src/backend/optimizer/path/indxpath.c | 206 +++++++--
 src/backend/utils/adt/selfuncs.c      |  56 ++-
 6 files changed, 919 insertions(+), 84 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa797..5935dbc86 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1034,6 +1034,42 @@ typedef struct BTArrayKeyInfo
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+	/*
+	 * Input parameters set by _bt_readpage, for _bt_checkkeys.
+	 *
+	 * dir: scan direction
+	 *
+	 * highkey:	page high key
+	 *
+	 * SK_SEARCHARRAY forward scans are required to set the page high key up
+	 * front.
+	 */
+	ScanDirection dir;
+	IndexTuple	highkey;
+
+	/*
+	 * Output parameters set by _bt_checkkeys, for _bt_readpage.
+	 *
+	 * continuescan: Is there a need to continue the scan beyond this tuple?
+	 */
+	bool		continuescan;
+
+	/*
+	 * Private _bt_checkkeys state, describes caller's page.
+	 *
+	 * match_for_cur_array_keys: _bt_checkkeys returned true once or more?
+	 *
+	 * highkeychecked: Current set of array keys checked against high key?
+	 */
+	bool		match_for_cur_array_keys;
+	bool		highkeychecked;
+} BTReadPageState;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1047,7 +1083,9 @@ typedef struct BTScanOpaqueData
 								 * there are any unsatisfiable array keys) */
 	int			arrayKeyCount;	/* count indicating number of array scan keys
 								 * processed */
+	bool		arrayKeysStarted;	/* Scan still processing array keys? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	BTScanInsert arrayPoskey;	/* initial positioning insertion scan key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1253,8 +1291,12 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan);
+extern void _bt_array_keys_save_scankeys(IndexScanDesc scan,
+										 BTScanInsert inskey);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, bool final,
+						  BTReadPageState *pstate);
+extern void _bt_checkfinalkeys(IndexScanDesc scan, BTReadPageState *pstate);
+extern bool _bt_nocheckkeys(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4553aaee5..7ccd5f3f3 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -363,7 +363,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
 	so->numArrayKeys = 0;
+	so->arrayKeysStarted = false;
 	so->arrayKeys = NULL;
+	so->arrayPoskey = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -404,6 +406,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	so->markItemIndex = -1;
 	so->arrayKeyCount = 0;
+	so->arrayKeysStarted = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -752,7 +755,23 @@ _bt_parallel_done(IndexScanDesc scan)
  *			keys.
  *
  * Updates the count of array keys processed for both local and parallel
- * scans.
+ * scans. (XXX Really? Then why is "scan->parallel_scan != NULL" used as a
+ * gating condition by our caller?)
+ *
+ * XXX Local advancement of array keys occurs dynamically, and affects the
+ * top-level scan state.  This is at odds with how parallel scans deal with
+ * array key advancement here, so for now we just don't support them at all.
+ *
+ * The issue here is that the leader instructs workers to process array keys
+ * in whatever order is convenient, without concern for repeat or concurrent
+ * accesses to the same physical leaf pages by workers.  This can be addressed
+ * by assigning batches of array keys to workers.  Each individual batch would
+ * match a range from the key space covered by some specific leaf page.  That
+ * whole approach requires dynamic back-and-forth key space partitioning.
+ *
+ * It seems important that parallel index scans match serial index scans in
+ * promising that no single leaf page will be accessed more than once.  That
+ * makes reasoning about the worst case much easier when costing scans.
  */
 void
 _bt_parallel_advance_array_keys(IndexScanDesc scan)
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3230b3b89..dcf399acd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -890,6 +890,18 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
+	/*
+	 * XXX Queries with SAOPs have always accounted for each call here as one
+	 * "index scan".  This meant that the accounting showed one index scan per
+	 * distinct SAOP constant.  This approach is consistent with how it was
+	 * done before nbtree was taught to handle ScalarArrayOpExpr quals itself
+	 * (it's also how non-amsearcharray index AMs still do it).
+	 *
+	 * Right now, eliding a primitive index scan elides a call here, resulting
+	 * in one less "index scan" recorded by pgstat.  This seems defensible,
+	 * though not necessarily desirable.  Now implementation details can have
+	 * a significant impact on user-visible index scan counts.
+	 */
 	pgstat_count_index_scan(rel);
 
 	/*
@@ -1370,6 +1382,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
+	/*
+	 * Save insertion scan key for SK_SEARCHARRAY scans, which need it to
+	 * advance the scan's array keys locally
+	 */
+	if (so->numArrayKeys > 0)
+		_bt_array_keys_save_scankeys(scan, &inskey);
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
@@ -1548,9 +1567,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
+	BTReadPageState pstate;
 	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1570,8 +1588,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	pstate.dir = dir;
+	pstate.highkey = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.match_for_cur_array_keys = false;
+	pstate.highkeychecked = false;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1606,6 +1628,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY scans must provide high key up front */
+		if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1628,7 +1658,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+			if (_bt_checkkeys(scan, itup, false, &pstate))
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
@@ -1661,7 +1691,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1678,17 +1708,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan)
 		{
-			ItemId		iid = PageGetItemId(page, P_HIKEY);
-			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
-			int			truncatt;
+			if (!P_RIGHTMOST(opaque) && !pstate.highkey)
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+				pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+			}
+
+			_bt_checkfinalkeys(scan, &pstate);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1722,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
-				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				Assert(offnum >= minoff);
+				if (offnum > minoff)
 				{
 					offnum = OffsetNumberPrev(offnum);
 					continue;
@@ -1736,8 +1768,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan);
+			passes_quals = _bt_checkkeys(scan, itup, offnum == minoff,
+										 &pstate);
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1776,16 +1808,25 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					}
 				}
 			}
-			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreLeft = false;
+			/* When !continuescan, there can't be any more matches, so stop */
+			if (!pstate.continuescan)
 				break;
-			}
 
 			offnum = OffsetNumberPrev(offnum);
 		}
 
+		/*
+		 * Backward scans never check the high key, but must still call
+		 * _bt_nocheckkeys when they reach the last page (the leftmost page)
+		 * without any tuple ever setting continuescan to false.
+		 */
+		if (pstate.continuescan && P_LEFTMOST(opaque) &&
+			_bt_nocheckkeys(scan, dir))
+			pstate.continuescan = false;
+
+		if (!pstate.continuescan)
+			so->currPos.moreLeft = false;
+
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
 		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4d..af8accbd3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -45,11 +45,19 @@ static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static bool _bt_advance_array_keys_locally(IndexScanDesc scan,
+										   IndexTuple tuple, bool final,
+										   BTReadPageState *pstate);
+static bool _bt_tuple_advances_keys(IndexScanDesc scan, IndexTuple tuple,
+									ScanDirection dir);
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanKey keyData, int keysz,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  ScanDirection dir, bool *continuescan);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -202,6 +210,29 @@ _bt_freestack(BTStack stack)
  * array keys, it's sufficient to find the extreme element value and replace
  * the whole array with that scalar value.
  *
+ * It's important that we consistently avoid leaving behind SK_SEARCHARRAY
+ * inequalities after preprocessing, since _bt_advance_array_keys_locally
+ * expects to be able to treat SK_SEARCHARRAY keys as equality constraints.
+ * This makes it possible for the scan to take advantage of naturally occuring
+ * locality to avoid continually redescending the index in _bt_first.  We can
+ * advance the array keys opportunistically inside _bt_check_array_keys.  This
+ * won't affect the externally visible behavior of the scan.
+ *
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved).  It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
+ * We always sort and deduplicate arrays up-front for equality array keys.
+ * ScalarArrayOpExpr execution need only visit leaf pages that might contain
+ * matches exactly once, while preserving the sort order of the index.  This
+ * isn't just about performance; it also avoids needing duplicate elimination
+ * of matching TIDs (we prefer deduplicating search keys once, up-front).
+ * Equality SK_SEARCHARRAY keys are disjuncts that we always process in
+ * index/key space order, which makes this general approach feasible.  Every
+ * index tuple will match no more than one single distinct combination of
+ * equality-constrained keys (array keys and other equality keys).
+ *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
@@ -539,6 +570,9 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
+
+	/* Tell _bt_advance_array_keys to advance array keys when called */
+	so->arrayKeysStarted = true;
 }
 
 /*
@@ -546,6 +580,10 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ *
+ * On false result, local advancement of the array keys has reached the end of
+ * each of the arrays for the current scan direction.  Only our btgettuple and
+ * btgetbitmap callers should rely on this.
  */
 bool
 _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -554,6 +592,9 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 	bool		found = false;
 	int			i;
 
+	if (!so->arrayKeysStarted)
+		return false;
+
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
@@ -594,6 +635,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* Scan reached the end of its array keys in the current scan direction */
+	if (!found)
+		so->arrayKeysStarted = false;
+
 	/* advance parallel scan */
 	if (scan->parallel_scan != NULL)
 		_bt_parallel_advance_array_keys(scan);
@@ -601,6 +646,391 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 	return found;
 }
 
+/*
+ * Check if we need to advance SK_SEARCHARRAY array keys when _bt_checkkeys
+ * returns false and sets continuescan=false.  It's possible that the tuple
+ * will be a match after we advance the array keys.
+ *
+ * It is often possible for SK_SEARCHARRAY scans to skip one or more primitive
+ * index scans.  Starting a new primitive scan is only required when it is
+ * truly necessary to reposition the top-level scan to some distant leaf page
+ * (where the start of the key space for the next set of search keys begins).
+ * This process (redescending the index) is implemented by calling _bt_first
+ * after the array keys are "globally advanced" by the top-level index scan.
+ *
+ * Starting a new primitive index scan is avoided whenever the end of matches
+ * for the current set of array keys happens to be physically close to the
+ * start of matches for the next set of array keys.  The technique isn't used
+ * when matches for the next set of array keys aren't found on the same leaf
+ * page (unless there is good reason to believe that a visit to the next leaf
+ * page needs to take place).
+ *
+ * In the worst case the top-level index scan performs one primitive index
+ * scan per distinct set of array/search keys.  In the best case we require
+ * only a single primitive index scan for the entire top-level index scan
+ * (this is even possible with arbitrarily-many distinct sets of array keys).
+ * The optimization is particularly effective with queries that have several
+ * SK_SEARCHARRAY keys (one per index column) when scanning a composite index.
+ * Most individual search key combinations (which are simple conjunctions) may
+ * well turn out to have no matching index tuples.
+ *
+ * Returns false when array keys have not or cannot advance.  A new primitive
+ * index scan will be required -- except when the top-level, btrescan-wise
+ * index scan has processed all array keys in the current scan direction.
+ *
+ * Returns true when array keys were advanced "locally".  Caller must recheck
+ * the tuple that initially set continuescan=false against the new array keys.
+ * At this point the newly advanced array keys are provisional.  The "current"
+ * keys only get "locked in" to the ongoing primitive scan when _bt_checkkeys
+ * returns its first match for the keys.  This must happen almost immediately;
+ * we should only invest in eliding primitive index scans when we're almost
+ * certain that it'll work out.
+ *
+ * Note: The fact that we only advance array keys "provisionally" imposes a
+ * requirement on _bt_readpage: it must call _bt_checkfinalkeys whenever its
+ * scan of a leaf page wasn't terminated when it called _bt_checkkeys against
+ * non-pivot tuples.  This scheme ensures that we'll always have at least one
+ * opportunity to change our minds per leaf page scanned (even, say, on a page
+ * that only contains non-pivot tuples whose LP_DEAD bits are set).
+ *
+ * Note: We can determine that the next leaf page ought to be handled by the
+ * ongoing primitive index scan without being fully sure that it'll work out.
+ * This occasionally results in primitive index scans that waste cycles on a
+ * useless visit to an extra page, which then terminates the primitive scan.
+ * Such wasted accesses are only possible when the high key (or the final key
+ * in the case of backwards scans) is within the bounds of the latest set of
+ * array keys that the primitive scan can advance to.
+ *
+ * Note: There are cases where we visit the next leaf page during a primitive
+ * index scan without being completely certain about whether or not we really
+ * need to visit that page at all.  In other words, sometimes we speculatively
+ * visit the next leaf page, which risks wasting a leaf page access.
+ */
+static bool
+_bt_advance_array_keys_locally(IndexScanDesc scan, IndexTuple tuple,
+							   bool final, BTReadPageState *pstate)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(!pstate->continuescan);
+	Assert(so->arrayKeysStarted);
+
+	if (!so->arrayPoskey)
+	{
+		/*
+		 * Scans that lack an initial positioning key (and so must go through
+		 * _bt_endpoint rather than calling _bt_search from _bt_first) are not
+		 * capable of locally advancing array keys
+		 */
+		return false;
+	}
+
+	/*
+	 * Current search type scan keys (including current array keys) indicated
+	 * that this tuple terminates the scan in _bt_checkkeys caller.  Can this
+	 * tuple be a match for later sets of array keys, once advanced?
+	 */
+	if (!_bt_tuple_advances_keys(scan, tuple, pstate->dir))
+	{
+		/*
+		 * Tuple definitely isn't a match for any set of search keys.  Tuple
+		 * definitely won't be returned by _bt_checkkeys.  Now we need to
+		 * determine if the scan will continue to the next tuple/page.
+		 *
+		 * If this is a forwards scan, check the high key -- page state
+		 * stashes it in order to allow us to terminate processing of a page
+		 * (and the primitive index scan as a whole) early.
+		 *
+		 * If this is a backwards scan, treat the first non-pivot tuple as a
+		 * stand-in for the page high key.  Unlike the forward scan case, this
+		 * is only possible when _bt_checkkeys reaches the final tuple on the
+		 * page.  (Only the more common forward scan case has the ability to
+		 * end the scan of an individual page early using the high key because
+		 * we always have the high key stashed.)
+		 *
+		 * This always needs to happen before we leave each leaf page, for all
+		 * sets of array keys up to and including the last set we advance to.
+		 * We must avoid becoming confused about which primitive index scan
+		 * (the current or the next) returns matches for any set of array
+		 * keys.
+		 */
+		if (!pstate->match_for_cur_array_keys &&
+			(final || (!pstate->highkeychecked && pstate->highkey)))
+		{
+			Assert(ScanDirectionIsForward(pstate->dir) || !pstate->highkey);
+			Assert(ScanDirectionIsBackward(pstate->dir) || !final);
+
+			pstate->highkeychecked = true;	/* iff this is a forward scan */
+
+			if (final || !_bt_tuple_advances_keys(scan, pstate->highkey,
+												  pstate->dir))
+			{
+				/*
+				 * We're unlikely to find any further matches for the current
+				 * set of array keys on the next sibling leaf page.
+				 *
+				 * Back up the array keys so that btgettuple or btgetbitmap
+				 * won't advance the keys past the now-current set.  This is
+				 * safe because we haven't returned any tuples matching this
+				 * set of keys.
+				 */
+				ScanDirection flipdir = -pstate->dir;
+
+				if (!_bt_advance_array_keys(scan, flipdir))
+					Assert(false);
+
+				_bt_preprocess_keys(scan);
+
+				/* End the current primitive index scan */
+				pstate->continuescan = false;	/* redundant */
+				return false;
+			}
+		}
+
+		/*
+		 * Continue the current primitive index scan.  Returning false
+		 * indicates that we're done with this tuple.  The ongoing primitive
+		 * index scan will proceed to the next non-pivot tuple on this page
+		 * (or to the first non-pivot tuple on the next page).
+		 */
+		pstate->continuescan = true;
+		return false;
+	}
+
+	if (!_bt_advance_array_keys(scan, pstate->dir))
+	{
+		Assert(!so->arrayKeysStarted);
+
+		/*
+		 * Ran out of array keys to advance the scan to.  The top-level,
+		 * btrescan-wise scan has been terminated by this tuple.
+		 */
+		pstate->continuescan = false;	/* redundant */
+		return false;
+	}
+
+	/*
+	 * Successfully advanced the array keys.  We'll now need to see what
+	 * _bt_checkkeys loop says about the same tuple with this new set of keys.
+	 *
+	 * Advancing the arrays keys is only provisional at this point.  If there
+	 * are no matches for the new array keys before we leave the page, and
+	 * high key check indicates that there is little chance of finding any
+	 * matches for the new keys on the next page, we will change our mind.
+	 * This is handled by "backing up" the array keys, and then starting a new
+	 * primitive index scan for the same set of array keys.
+	 *
+	 * XXX Clearly it would be a lot more efficient if we were to implement
+	 * all this by searching for the next set of array keys using this tuple's
+	 * key values, directly.  Right now we effectively use a linear search
+	 * (though one that can terminate upon finding the first match).  We must
+	 * make it into a binary search to get acceptable performance.
+	 *
+	 * Our current naive approach works well enough for prototyping purposes,
+	 * but chokes in extreme cases where the Cartesian product of all SAOP
+	 * arrays (i.e. the total number of DNF single value predicates generated
+	 * by the _bt_advance_array_keys state machine) starts to get unwieldy.
+	 * We're holding a buffer lock here, so this isn't really negotiable.
+	 *
+	 * It's not particular unlikely that the total number of DNF predicates
+	 * exceeds the number of tuples that'll be returned by the ongoing scan.
+	 * Efficiently advancing the array keys might turn out to matter almost as
+	 * much as efficiently searching for the next matching index tuple.
+	 */
+	_bt_preprocess_keys(scan);
+
+	if (pstate->highkey)
+	{
+		/* High key precheck might need to be repeated for new array keys */
+		pstate->match_for_cur_array_keys = false;
+		pstate->highkeychecked = false;
+	}
+
+	/*
+	 * Note: It doesn't matter how continuescan is set by us at this point.
+	 * The next iteration of caller's loop will overwrite continuescan.
+	 */
+	return true;
+}
+
+/*
+ * Helper routine used by _bt_advance_array_keys_locally.
+ *
+ * We're called with tuples that _bt_checkkeys set continuescan to false for.
+ * We distinguish between search-type scan keys that have equality constraints
+ * on an index column (which are always marked as required in both directions)
+ * and other search-type scan keys that are required in one direction only.
+ * The distinction is important independent of the current scan direction,
+ * since caller should only advance array keys when an equality constraint
+ * indicated the end of the current set of array keys.  (Note also that
+ * non-equality "required in one direction only" scan keys can only end the
+ * entire btrescan-wise scan when we run out of array keys to process for the
+ * current scan direction).
+ *
+ * We help our caller identify where matches for the next set of array keys
+ * _might_ begin when it turns out that we can elide another descent of the
+ * index for the next set of array keys.  There will be a gap of 0 or more
+ * non-matching index tuples between the last tuple that satisfies the current
+ * set of scan keys (including its array keys), and the first tuple that might
+ * satisfy the next set (caller won't know for sure until after it advances
+ * the current set of array keys).  This gap might be negligible, or it might
+ * be a significant fraction of all non-pivot tuples on the leaf level.
+ *
+ * The qual "WHERE x IN (3,4,5) AND y < 42" will have its 'y' scan key marked
+ * SK_BT_REQFWD (not SK_BT_REQBKWD) -- 'y' isn't an equality constraint.
+ * _bt_checkkeys will set continuescan=false as soon as the scan reaches a
+ * tuple matching (3, 42) or a tuple matching (4, 1).  Eliding the next
+ * primitive index scan (by advancing the array keys locally) happens when the
+ * gap is confined to a single leaf page.  Caller continues its scan through
+ * these gap tuples, and calls back here to check if it has found the point
+ * that it might be necessary to advance its array keys.
+ *
+ * Returns false when caller's tuple definitely isn't where the next group of
+ * matching tuples begins.  Caller can either continue the process with the
+ * very next tuple from its leaf page, or give up completely.  Giving up means
+ * that caller accepts that there must be another _bt_first descent (in the
+ * likely event of another call to btgettuple/btgetbitmap from the executor).
+ *
+ * Returns true when caller passed a tuple that might be a match for the next
+ * set of array keys.  That is, when tuple is > the current set of array keys
+ * and other equality constraints for a forward scan (or < for a backwards
+ * scans).  Caller must attempt to advance the array keys when this happens.
+ *
+ * Note: Our test is based on the current equality constraint scan keys rather
+ * than the next set in line because it's not yet clear if the next set in
+ * line will find any matches whatsoever.  Once caller is positioned at the
+ * first tuple that might satisfy the next set of array keys, it could be
+ * necessary for it to advance its array keys more than once.
+ */
+static bool
+_bt_tuple_advances_keys(IndexScanDesc scan, IndexTuple tuple, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	bool		tuple_ahead = true;
+	int			ncmpkey;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(so->arrayPoskey->keysz > 0);
+
+	ncmpkey = Min(BTreeTupleGetNAtts(tuple, rel), so->numberOfKeys);
+	for (int attnum = 1; attnum <= ncmpkey; attnum++)
+	{
+		ScanKey		cur = &so->keyData[attnum - 1];
+		ScanKey		iscankey;
+		Datum		datum;
+		bool		isNull;
+		int32		result;
+
+		if ((ScanDirectionIsForward(dir) &&
+			 (cur->sk_flags & SK_BT_REQFWD) == 0) ||
+			(ScanDirectionIsBackward(dir) &&
+			 (cur->sk_flags & SK_BT_REQBKWD) == 0))
+		{
+			/*
+			 * This scan key is not marked as required for the current
+			 * direction, so there are no further attributes to consider. This
+			 * tuple definitely isn't at the start of the next group of
+			 * matching tuples.
+			 */
+			break;
+		}
+
+		Assert(cur->sk_attno == attnum);
+		if (cur->sk_attno > so->arrayPoskey->keysz)
+		{
+			/*
+			 * There is no equality constraint on this column/scan key to
+			 * break the tie.  This tuple definitely isn't at the start of the
+			 * next group of matching tuples.
+			 */
+			Assert(cur->sk_strategy != BTEqualStrategyNumber);
+			Assert((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) !=
+				   (SK_BT_REQFWD | SK_BT_REQBKWD));
+			break;
+		}
+
+		/*
+		 * This column has an equality constraint/insertion scan key entry
+		 */
+		Assert((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) ==
+			   (SK_BT_REQFWD | SK_BT_REQBKWD));
+		Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+		/*
+		 * Row comparison scan keys may be present after (though never before)
+		 * columns that we recognized as having equality constraints.
+		 *
+		 * A qual like "WHERE a in (1, 2, 3) AND (b, c) >= (500, 7)" is safe,
+		 * whereas "WHERE (a, b) >= (1, 500) AND c in (7, 8, 9)" is unsafe.
+		 * Assert that this isn't one of the unsafe cases in passing.
+		 */
+		Assert((cur->sk_flags & SK_ROW_HEADER) == 0);
+
+		/*
+		 * We'll need to use this attribute's 3-way comparison order proc
+		 * (btree opclass support function 1) from its insertion-type scan key
+		 */
+		iscankey = &so->arrayPoskey->scankeys[attnum - 1];
+		Assert(iscankey->sk_flags == cur->sk_flags);
+		Assert(iscankey->sk_attno == cur->sk_attno);
+		Assert(iscankey->sk_subtype == cur->sk_subtype);
+		Assert(iscankey->sk_collation == cur->sk_collation);
+
+		/*
+		 * The 3-way comparison order proc will be called using the
+		 * search-type scan key's current sk_argument
+		 */
+		datum = index_getattr(tuple, attnum, itupdesc, &isNull);
+		if (iscankey->sk_flags & SK_ISNULL) /* key is NULL */
+		{
+			if (isNull)
+				result = 0;		/* NULL "=" NULL */
+			else if (iscankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = -1;	/* NULL "<" NOT_NULL */
+			else
+				result = 1;		/* NULL ">" NOT_NULL */
+		}
+		else if (isNull)		/* key is NOT_NULL and item is NULL */
+		{
+			if (iscankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = 1;		/* NOT_NULL ">" NULL */
+			else
+				result = -1;	/* NOT_NULL "<" NULL */
+		}
+		else
+		{
+			/*
+			 * The sk_func needs to be passed the index value as left arg and
+			 * the sk_argument as right arg (they might be of different
+			 * types).  We want to keep this consistent with what _bt_compare
+			 * does, so we flip the sign of the comparison result.  (Unless
+			 * it's a DESC column, in which case we *don't* flip the sign.)
+			 */
+			result = DatumGetInt32(FunctionCall2Coll(&iscankey->sk_func,
+													 cur->sk_collation, datum,
+													 cur->sk_argument));
+			if (!(iscankey->sk_flags & SK_BT_DESC))
+				INVERT_COMPARE_RESULT(result);
+		}
+
+		if (result != 0)
+		{
+			if (ScanDirectionIsForward(dir))
+				tuple_ahead = result < 0;
+			else
+				tuple_ahead = result > 0;
+
+			break;
+		}
+	}
+
+	return tuple_ahead;
+}
+
 /*
  * _bt_mark_array_keys() -- Handle array keys during btmarkpos
  *
@@ -744,6 +1174,12 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * storage is that we are modifying the array based on comparisons of the
  * key argument values, which could change on a rescan or after moving to
  * new elements of array keys.  Therefore we can't overwrite the source data.
+ *
+ * TODO Replace all calls to this function added by the patch with calls to
+ * some other more specialized function with reduced surface area -- something
+ * that is explicitly safe to call while holding a buffer lock.  That's been
+ * put off for now because the code in this function is likely to need to be
+ * better integrated with the planner before long anyway.
  */
 void
 _bt_preprocess_keys(IndexScanDesc scan)
@@ -1012,6 +1448,45 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+/*
+ * Save insertion scankey for searches with a SK_SEARCHARRAY scan key.
+ *
+ * We must save the initial positioning insertion scan key for SK_SEARCHARRAY
+ * scans (barring those that only have SK_SEARCHARRAY inequalities).  Each
+ * insertion scan key entry/column will have a corresponding "=" operator in
+ * caller's search-type scan key, but that's no substitute for the 3-way
+ * comparison function.
+ *
+ * _bt_tuple_advances_keys needs to perform 3-way comparisons to figure out if
+ * an ongoing scan can elide another descent of the index in _bt_first.  It
+ * works by locating the end of the _current_ set of equality constraint type
+ * scan keys -- not by locating the start of the next set.  This is not unlike
+ * the approach taken by _bt_search with a nextkey=true search.
+ */
+void
+_bt_array_keys_save_scankeys(IndexScanDesc scan, BTScanInsert inskey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Size		sksize;
+
+	Assert(inskey->keysz > 0);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->qual_ok);
+	Assert(!BTScanPosIsValid(so->currPos));
+
+	if (so->arrayPoskey)
+	{
+		/* Reuse the insertion scan key from the last primitive index scan */
+		Assert(so->arrayPoskey->keysz == inskey->keysz);
+		return;
+	}
+
+	sksize = offsetof(BTScanInsertData, scankeys) +
+		sizeof(ScanKeyData) * inskey->keysz;
+	so->arrayPoskey = palloc(sksize);
+	memcpy(so->arrayPoskey, inskey, sksize);
+}
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1348,35 +1823,68 @@ _bt_mark_scankey_required(ScanKey skey)
  * this tuple, and set *continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Advances the current set of array keys locally for SK_SEARCHARRAY scans
+ * where appropriate.  These callers are required to initialize the page level
+ * high key in pstate.
  *
  * scan: index scan descriptor (containing a search-type scankey)
  * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * final: final tuple/call for this page, from a backwards scan?
+ * pstate: Page level input and output parameters
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan)
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, bool final,
+			  BTReadPageState *pstate)
+{
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	int			natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+
+	/* This loop handles advancing to the next array elements, if any */
+	do
+	{
+		res = _bt_check_compare(so->keyData, so->numberOfKeys,
+								tuple, natts, tupdesc,
+								pstate->dir, &pstate->continuescan);
+
+		/* If we have a tuple, return it ... */
+		if (res)
+		{
+			pstate->match_for_cur_array_keys = true;
+
+			Assert(!so->numArrayKeys || !so->arrayPoskey ||
+				   _bt_tuple_advances_keys(scan, tuple, pstate->dir));
+			break;
+		}
+
+		/* ... otherwise see if we have more array keys to deal with */
+	} while (so->numArrayKeys && !pstate->continuescan &&
+			 _bt_advance_array_keys_locally(scan, tuple, final, pstate));
+
+	return res;
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.
+ */
+static bool
+_bt_check_compare(ScanKey keyData, int keysz,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  ScanDirection dir, bool *continuescan)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
 	int			ikey;
 	ScanKey		key;
 
-	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
 	*continuescan = true;		/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = keyData, ikey = 0; ikey < keysz; key++, ikey++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -1523,7 +2031,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1690,6 +2198,49 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 	return result;
 }
 
+void
+_bt_checkfinalkeys(IndexScanDesc scan, BTReadPageState *pstate)
+{
+	IndexTuple	highkey = pstate->highkey;
+
+	Assert(pstate->continuescan);
+
+	if (!pstate->highkey)
+	{
+		_bt_nocheckkeys(scan, pstate->dir);
+		pstate->continuescan = false;
+		return;
+	}
+
+	pstate->highkey = NULL;
+	_bt_checkkeys(scan, highkey, false, pstate);
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ *
+ * Called on the rightmost page in the forward scan case, and the leftmost
+ * page in the backwards scan case.  Only call here when _bt_checkkeys hasn't
+ * already set continuescan to false.
+ */
+bool
+_bt_nocheckkeys(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	/* Only need to do real work in SK_SEARCHARRAY case, for now */
+	if (!so->numArrayKeys)
+		return false;
+
+	Assert(so->arrayKeysStarted);
+
+	while (_bt_advance_array_keys(scan, dir))
+		_bt_preprocess_keys(scan);
+
+	return true;
+}
+
 /*
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..73064758d 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -32,6 +32,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 
@@ -107,7 +108,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
 							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_unordered_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +707,8 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
+ * Furthermore, we should consider excluding ScalarArrayOpExpr quals whose
+ * inclusion would force the path as a whole to be unordered.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,28 +717,28 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
+	bool		skip_unordered_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
 	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses for index columns whose inclusion would make it impossible to
+	 * produce ordered paths.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
 								   &skip_nonnative_saop,
-								   &skip_lower_saop);
+								   &skip_unordered_saop);
 
 	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
+	 * If we skipped any ScalarArrayOpExprs without ordered paths on an index
+	 * with an AM that supports them, then try again including those clauses.
+	 * This will produce paths with more selectivity.
 	 */
-	if (skip_lower_saop)
+	if (skip_unordered_saop)
 	{
 		indexpaths = list_concat(indexpaths,
 								 build_index_paths(root, rel,
@@ -817,11 +818,9 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
+ * If skip_unordered_saop is non-NULL, we ignore ScalarArrayOpExpr clauses
+ * whose inclusion forces us to treat the scan's output as unordered.  If it's
+ * NULL then we allow it, in order to produce paths with greater selectivity.
  *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
@@ -829,7 +828,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
+ * 'skip_unordered_saop' indicates whether to accept unordered SOAPs
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -837,7 +836,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
 				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_unordered_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,10 +847,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
+	bool		row_compare_seen_already;
+	bool		saop_included_already;
+	bool		saop_invalidates_ordering;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	int			prev_equality_indexcol;
 	int			indexcol;
 
 	/*
@@ -880,25 +882,27 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
+	 * saop_invalidates_ordering is set true if we accept a ScalarArrayOpExpr
+	 * index clause that invalidates the sort order.  In practice this is
+	 * always due to the presence of a non-first index column.  This prevents
+	 * us from assuming that the scan result is ordered.
 	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
+	prev_equality_indexcol = -1;
+	row_compare_seen_already = false;
+	saop_included_already = false;
+	saop_invalidates_ordering = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
+		List	   *colclauses = clauses->indexclauses[indexcol];
 		ListCell   *lc;
 
-		foreach(lc, clauses->indexclauses[indexcol])
+		foreach(lc, colclauses)
 		{
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
@@ -906,6 +910,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			/* We might need to omit ScalarArrayOpExpr clauses */
 			if (IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
+				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
+
 				if (!index->amsearcharray)
 				{
 					if (skip_nonnative_saop)
@@ -916,18 +922,152 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 					}
 					/* Caller had better intend this only for bitmap scan */
 					Assert(scantype == ST_BITMAPSCAN);
+					saop_invalidates_ordering = true;	/* defensive */
+					goto include_clause;
 				}
-				if (indexcol > 0)
+
+				/*
+				 * Index AM that handles ScalarArrayOpExpr quals natively.
+				 *
+				 * We assume that it's always better to apply a clause as an
+				 * indexqual than as a filter (qpqual); which is where an
+				 * available clause would end up being applied if we omit it
+				 * from the indexquals.
+				 *
+				 * XXX Currently, nbtree just assumes that all SK_SEARCHARRAY
+				 * search-type scankeys will be marked as required, with the
+				 * exception of the first attribute without an "=" key (any
+				 * such attribute is marked SK_BT_REQFWD or SK_BT_REQBKWD, but
+				 * it won't be in the initial positioning insertion scan key,
+				 * so _bt_array_continuescan() won't consider it).
+				 */
+				if (row_compare_seen_already)
 				{
-					if (skip_lower_saop)
+					/*
+					 * Cannot safely include a ScalarArrayOpExpr after a
+					 * higher-order RowCompareExpr (barring the "=" case).
+					 */
+					Assert(indexcol > 0);
+					continue;
+				}
+
+				/*
+				 * Make a blanket assumption that any index column with more
+				 * than a single clause cannot include ScalarArrayOpExpr
+				 * clauses >= that column.  Quals like "WHERE my_col in (1,2)
+				 * AND my_col < 1" are unsafe without this.
+				 *
+				 * XXX This is overkill.
+				 */
+				if (list_length(colclauses) > 1)
+					continue;
+
+				if (indexcol != prev_equality_indexcol + 1)
+				{
+					/*
+					 * An index attribute that lacks an equality constraint
+					 * was included as a clause already.  This may make it
+					 * unsafe to include this ScalarArrayOpExpr clause now.
+					 */
+					if (saop_included_already)
 					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
+						/*
+						 * We included at least one ScalarArrayOpExpr clause
+						 * earlier, too.  (This must have been included before
+						 * the inequality, since we treat ScalarArrayOpExpr
+						 * clauses as equality constraints by default.)
+						 *
+						 * We cannot safely include this ScalarArrayOpExpr as
+						 * a clause for the current index path.  It'll become
+						 * qpqual conditions instead.
+						 */
 						continue;
 					}
-					found_lower_saop_clause = true;
+
+					/*
+					 * This particular ScalarArrayOpExpr happens to be the
+					 * most significant one encountered so far.  That makes it
+					 * safe to include, despite gaps in constraints on prior
+					 * index columns -- provided we invalidate ordering for
+					 * the index path as a whole.
+					 */
+					if (skip_unordered_saop)
+					{
+						/* Caller doesn't want to lose index ordering */
+						*skip_unordered_saop = true;
+						continue;
+					}
+
+					/* Caller prioritizes selectivity over ordering */
+					saop_invalidates_ordering = true;
 				}
+
+				/*
+				 * Includable ScalarArrayOpExpr clauses are themselves
+				 * equality constraints (they don't make the inclusion of
+				 * further ScalarArrayOpExpr clauses invalidate ordering).
+				 *
+				 * XXX excludes inequality-type SAOPs using get_oprrest, which
+				 * seems particularly kludgey.
+				 */
+				saop_included_already = true;
+				if (saop->useOr && get_oprrest(saop->opno) == F_EQSEL)
+					prev_equality_indexcol = indexcol;
 			}
+			else if (IsA(rinfo->clause, NullTest))
+			{
+				NullTest   *nulltest = (NullTest *) rinfo->clause;
+
+				/*
+				 * Like ScalarArrayOpExpr clauses, IS NULL NullTest clauses
+				 * are treated as equality conditions, despite not being
+				 * recognized as such by the equivalence class machinery.
+				 *
+				 * This relies on the assumption that amsearcharray index AMs
+				 * will treat NULL as just another value from the domain of
+				 * indexed values for initial search purposes.
+				 */
+				if (!nulltest->argisrow && nulltest->nulltesttype == IS_NULL)
+					prev_equality_indexcol = indexcol;
+			}
+			else if (IsA(rinfo->clause, RowCompareExpr))
+			{
+				/*
+				 * RowCompareExpr clause will make it unsafe to include any
+				 * ScalarArrayOpExpr encountered in lower-order clauses.
+				 * (Already-included ScalarArrayOpExpr clauses can stay.)
+				 */
+				row_compare_seen_already = true;
+			}
+			else if (rinfo->mergeopfamilies)
+			{
+				/*
+				 * Equality constraint clause -- won't make it unsafe to
+				 * include later ScalarArrayOpExpr clauses
+				 */
+				prev_equality_indexcol = indexcol;
+			}
+			else
+			{
+				/*
+				 * Clause isn't an equality condition according to the EQ
+				 * machinery (not a NullTest or ScalarArrayOpExpr, either).
+				 *
+				 * If there are any later ScalarArrayOpExpr clauses, they must
+				 * not be used as index quals.  We'll either make it safe by
+				 * setting saop_invalidates_ordering to true, or by just not
+				 * including them (they can still be qpqual conditions).
+				 *
+				 * Note: there are several interesting types of expressions
+				 * that we deem incompatible with ScalarArrayOpExpr clauses
+				 * due to a lack of infrastructure to perform transformations
+				 * of predicates from CNF (conjunctive normal form) to DNF
+				 * (disjunctive normal form).  The MDAM paper describes many
+				 * examples of these transformations.
+				 */
+			}
+
+	include_clause:
 
 			/* OK to include this clause */
 			index_clauses = lappend(index_clauses, iclause);
@@ -960,7 +1100,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * assume the scan is unordered.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
+								!saop_invalidates_ordering &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..51de102b0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6700,9 +6700,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
 	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
+	 * If there's a ScalarArrayOpExpr in the quals, we'll perform N primitive
+	 * index scans in the worst case.  Assume that worst case, for now.  We'll
+	 * clamp later on if the tally approaches the total number of index pages.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
@@ -6754,7 +6754,15 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
+
+				/*
+				 * Count number of SA scans induced by indexBoundQuals only.
+				 *
+				 * Since this is multiplicative, it can wildly inflate the
+				 * assumed number of descents (number of primitive index
+				 * scans) for scans with several SAOP clauses.  We might clamp
+				 * num_sa_scans later on to deal with this.
+				 */
 				if (alength > 1)
 					num_sa_scans *= alength;
 			}
@@ -6832,6 +6840,39 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * The btree index AM will automatically combine individual primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This optimization
+	 * makes the final number of descents particularly difficult to estimate.
+	 * However, btree scans never visit any single leaf page more than once.
+	 * That puts a natural floor under the worst case number of descents.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX genericcostestimate is still the dominant influence on the total
+	 * cost of SAOP-heavy index paths -- indexTotalCost is still calculated in
+	 * a way that assumes significant repeat access to leaf pages for a path
+	 * with SAOP clauses.  This just isn't sensible anymore.  Note that nbtree
+	 * scans promise to avoid accessing any leaf page more than once.  The
+	 * worst case I/O cost of an SAOP-heavy path is therefore guaranteed to
+	 * never exceed the I/O cost of a conventional full index scan (though
+	 * this relies on standard assumptions about internal page access costs).
+	 */
+	if (num_sa_scans > 1)
+	{
+		num_sa_scans = Min(num_sa_scans, costs.numIndexPages);
+		num_sa_scans = Min(num_sa_scans, index->pages / 3);
+		num_sa_scans = Max(num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6847,7 +6888,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	{
 		descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost;
 		costs.indexStartupCost += descentCost;
-		costs.indexTotalCost += costs.num_sa_scans * descentCost;
+		costs.indexTotalCost += num_sa_scans * descentCost;
 	}
 
 	/*
@@ -6858,11 +6899,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
-	costs.indexTotalCost += costs.num_sa_scans * descentCost;
+	costs.indexTotalCost += num_sa_scans * descentCost;
 
 	/*
 	 * If we can get an estimate of the first column's ordering correlation C
-- 
2.40.1

Matthias van de Meent

boekewurm+postgres@gmail.com

over 2 years ago

In reply to: Peter Geoghegan (#1)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan <pg@bowt.ie> wrote:

I've been working on a variety of improvements to nbtree's native
ScalarArrayOpExpr execution. This builds on Tom's work in commit
9e8da0f7.

Cool.

Attached patch is still at the prototype stage. I'm posting it as v1 a
little earlier than I usually would because there has been much back
and forth about it on a couple of other threads involving Tomas Vondra
and Jeff Davis -- seems like it would be easier to discuss with
working code available.

The patch adds two closely related enhancements to ScalarArrayOp
execution by nbtree:

1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
"advance the scan's array keys locally", which sometimes avoids
significant amounts of unneeded pinning/locking of the same set of
index pages.

SAOP index scans become capable of eliding primitive index scans for
the next set of array keys in line in cases where it isn't truly
necessary to descend the B-Tree again. Index scans are now capable of
"sticking with the existing leaf page for now" when it is determined
that the end of the current set of array keys is physically close to
the start of the next set of array keys (the next set in line to be
materialized by the _bt_advance_array_keys state machine). This is
often possible.

Naturally, we still prefer to advance the array keys in the
traditional way ("globally") much of the time. That means we'll
perform another _bt_first/_bt_search descent of the index, starting a
new primitive index scan. Whether we try to skip pages on the leaf
level or stick with the current primitive index scan (by advancing
array keys locally) is likely to vary a great deal. Even during the
same index scan. Everything is decided dynamically, which is the only
approach that really makes sense.

This optimization can significantly lower the number of buffers pinned
and locked in cases with significant locality, and/or with many array
keys with no matches. The savings (when measured in buffers
pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
has shown that transaction throughput for variants of "pgbench -S"
designed to stress the implementation (hundreds of array constants)
under concurrent load can have up to 5.5x higher transaction
throughput with the patch. Less extreme cases (10 array constants,
spaced apart) see about a 20% improvement in throughput. There are
similar improvements to latency for the patch, in each case.

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?
That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

[...]
Skip Scan
=========

MDAM encompasses something that people tend to call "skip scan" --
terminology with a great deal of baggage. These days I prefer to call
it "filling in missing key predicates", per the paper. That's much
more descriptive, and makes it less likely that people will conflate
the techniques with InnoDB style "loose Index scans" -- the latter is
a much more specialized/targeted optimization. (I now believe that
these are very different things, though I was thrown off by the
superficial similarities for a long time. It's pretty confusing.)

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

[...]

Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used
for each column, that'd be 4^8, etc...
With an index column limit of 32, that's quite a lot of memory
potentially needed to execute the statement.
So, this begs the question: does this patch have the same issue? Does
it fail with OOM, does it gracefully fall back to the old behaviour
when the clauses are too complex to linearize/compose/fold into the
btree ordering clauses, or are scan keys dynamically constructed using
just-in-time- or generator patterns?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Matthias van de Meent (#2)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?

It doesn't really cache leaf pages at all. What it does is advance the
array keys locally, while the original buffer lock is still held on
that same page.

That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

I'm not sure what you mean. There is no reason why you need to do more
than one single descent of an index to scan many leaf pages using many
distinct sets of array keys. Obviously, this depends on being able to
observe that we really don't need to redescend the index to advance
the array keys, again and again. Note in particularly that this
usually works across leaf pages.

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

I think of that as being quite a different thing.

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

[...]

Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries.

Note that I haven't actually changed anything about the way that the
state machine generates new sets of single value predicates -- it's
still just cycling through each distinct set of array keys in the
patch.

What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all. Plus you can skip over them using the
next index tuple. So skipping works both ways.

--
Peter Geoghegan

Matthias van de Meent

boekewurm+postgres@gmail.com

over 2 years ago

In reply to: Peter Geoghegan (#3)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?

It doesn't really cache leaf pages at all. What it does is advance the
array keys locally, while the original buffer lock is still held on
that same page.

Hmm, then I had a mistaken understanding of what we do in _bt_readpage
with _bt_saveitem.

That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

I'm not sure what you mean. There is no reason why you need to do more
than one single descent of an index to scan many leaf pages using many
distinct sets of array keys. Obviously, this depends on being able to
observe that we really don't need to redescend the index to advance
the array keys, again and again. Note in particularly that this
usually works across leaf pages.

In a NestedLoop(inner=seqscan, outer=indexscan), the index gets
repeatedly scanned from the root, right? It seems that right now, we
copy matching index entries into a local cache (that is deleted on
amrescan), then we drop our locks and pins on the buffer, and then
start returning values from our local cache (in _bt_saveitem).
We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.
The worst case overhead of this would be 2 _bt_compares (to check if
the value is supposed to be fully located on the cached leaf page)
plus one memcpy( , , BLCKSZ) in the previous loop. With some smart
heuristics (e.g. page fill factor, number of distinct values, and
whether we previously hit this same leaf page in the previous scan of
this Node) we can probably also reduce this overhead to a minimum if
the joined keys are not correllated, but accellerate the query
significantly when we find out they are correllated.

Of course, in the cases where we'd expect very few distinct join keys
the planner would likely put a Memoize node above the index scan, but
for mostly unique join keys I think this could save significant
amounts of time, if only on buffer pinning and locking.

I guess I'll try to code something up when I have the time, as it
sounds not quite exactly related to your patch but an interesting
improvement nonetheless.

Kind regards,

Matthias van de Meent

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Matthias van de Meent (#4)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.

That sounds like block nested loop join. It's possible that that could
reuse some infrastructure from this patch, but I'm not sure.

In general, SAOP execution/MDAM performs "duplicate elimination before
it reads the data" by sorting and deduplicating the arrays up front.
While my patch sometimes elides a primitive index scan, primitive
index scans are already disjuncts that are combined to create what can
be considered one big index scan (that's how the planner and executor
think of them). The patch takes that one step further by recognizing
that it could quite literally be one big index scan in some cases (or
fewer, larger scans, at least). It's a natural incremental
improvement, as opposed to inventing a new kind of index scan. If
anything the patch makes SAOP execution more similar to traditional
index scans, especially when costing them.

Like InnoDB style loose index scan (for DISTINCT and GROUP BY
optimization), block nested loop join would require inventing a new
type of index scan. Both of these other two optimizations involve the
use of semantic information that spans multiple levels of abstraction.
Loose scan requires duplicate elimination (that's the whole point),
while IIUC block nested loop join needs to "simulate multiple inner
index scans" by deliberately returning duplicates for each would-be
inner index scan. These are specialized things.

To be clear, I think that all of these ideas are reasonable. I just
find it useful to classify these sorts of techniques according to
whether or not the index AM API would have to change or not, and the
general nature of any required changes. MDAM can do a lot of cool
things without requiring any revisions to the index AM API, which
should allow it to play nice with everything else (index path clause
safety issues notwithstanding).

--
Peter Geoghegan

Matthias van de Meent

boekewurm+postgres@gmail.com

over 2 years ago

In reply to: Peter Geoghegan (#3)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

I think of that as being quite a different thing.

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:

The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.

In the one page on "Loose indexscan", it refers to MySQL's "loose
index scan" documentation, which does handle groupings, and this was
targeted with the previous, mislabeled, "Index skipscan" patchset.
However, crucially, it also refers to other databases' Index Skip Scan
documentation, which document and implement this approach of 'skipping
to the next potential key range to get efficient non-prefix qual
results', giving me a false impression that those two features are one
and the same when they are not.

It seems like I'll have to wait a bit longer for the functionality of
Loose Index Scans.

[...]

Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries.

Note that I haven't actually changed anything about the way that the
state machine generates new sets of single value predicates -- it's
still just cycling through each distinct set of array keys in the
patch.

What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all.

Yes, that's why I asked: The MDAM paper's examples seem to materialize
the full predicate up-front, which would require a product of all
indexed columns' quals in size, so that materialization has a good
chance to get really, really large. But if we're not doing that
materialization upfront, then there is no issue with resource
consumption (except CPU time, which can likely be improved with other
methods)

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

Matthias van de Meent

boekewurm+postgres@gmail.com

over 2 years ago

In reply to: Peter Geoghegan (#5)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, 27 Jul 2023 at 06:14, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.

That sounds like block nested loop join. It's possible that that could
reuse some infrastructure from this patch, but I'm not sure.

My idea is not quite block nested loop join. It's more 'restart the
index scan at the location the previous index scan ended, if
heuristics say there's a good chance that might save us time'. I'd say
it is comparable to the fast tree descent optimization that we have
for endpoint queries, and comparable to this patch's scankey
optimization, but across AM-level rescans instead of internal rescans.

See also the attached prototype and loosely coded patch. It passes
tests, but it might not be without bugs.

The basic design of that patch is this: We keep track of how many
times we've rescanned, and the end location of the index scan. If a
new index scan hits the same page after _bt_search as the previous
scan ended, we register that. Those two values - num_rescans and
num_samepage - are used as heuristics for the following:

If 50% or more of rescans hit the same page as the end location of the
previous scan, we start saving the scan's end location's buffer into
the BTScanOpaque, so that the next _bt_first can check whether that
page might be the right leaf page, and if so, immediately go to that
buffer instead of descending the tree - saving one tree descent in the
process.

Further optimizations of this mechanism could easily be implemented by
e.g. only copying the min/max index tuples instead of the full index
page, reducing the overhead at scan end.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v1-0001-Cache-btree-scan-end-page-across-rescans-in-the-s.patch.cfbot-ignoreapplication/octet-stream; name=v1-0001-Cache-btree-scan-end-page-across-rescans-in-the-s.patch.cfbot-ignoreDownload

From e5232b8a8e90f60f45aadc813c5f024c9ecd8dab Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 27 Jul 2023 15:36:01 +0200
Subject: [PATCH v1] Cache btree scan end page across rescans in the same node

If the index is repeatedly scanned for values (e.g. in a nested loop) and if
the values that are being looked up are highly correlated, then we can likely
reuse the previous index scan's last page as a startpoint for the new scan,
instead of going through a relatively expensive index descent.
---
 src/backend/access/nbtree/nbtpage.c   |  18 +++++
 src/backend/access/nbtree/nbtree.c    |   8 +++
 src/backend/access/nbtree/nbtsearch.c | 100 ++++++++++++++++++++++++--
 src/include/access/nbtree.h           |  21 ++++++
 4 files changed, 141 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d78971bfe8..897b6772fc 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -856,6 +856,24 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	return buf;
 }
 
+Buffer
+_bt_getrecentbuf(Relation rel, BlockNumber blkno, Buffer buf, int access)
+{
+	Assert(BlockNumberIsValid(blkno));
+	Assert(BufferIsValid(buf));
+
+	if (!ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+	{
+		/* Read an existing block of the relation */
+		buf = ReadBuffer(rel, blkno);
+	}
+
+	_bt_lockbuf(rel, buf, access);
+	_bt_checkpage(rel, buf);
+
+	return buf;
+}
+
 /*
  *	_bt_allocbuf() -- Allocate a new block/page.
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4553aaee53..e4ca2f8ecb 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -376,6 +376,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->rescans = so->rescanSamePage = 0;
+	so->recentEndPage = InvalidBlockNumber;
+	so->pageCacheValid = false;
+	so->pageCache = (char *) palloc(BLCKSZ);
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -402,6 +407,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		BTScanPosInvalidate(so->currPos);
 	}
 
+	so->rescans++;
 	so->markItemIndex = -1;
 	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
@@ -467,6 +473,8 @@ btendscan(IndexScanDesc scan)
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
+	if (so->pageCache != NULL)
+		pfree(so->pageCache);
 	/* so->arrayKeyData and so->arrayKeys are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3230b3b894..67e597b56c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -47,6 +47,45 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
+static void _bt_endscanonpage(BTScanOpaque btso, Buffer buf,
+							  BlockNumber endPage, Page page);
+
+#define RescanMayHitSamePage(so) (((float) ((so)->rescanSamePage) / (float) ((so)->rescans)) >= 0.5)
+
+static void
+_bt_endscanonpage(BTScanOpaque btso, Buffer buf, BlockNumber endPage, Page page)
+{
+	BlockNumber prevEndPage = btso->recentEndPage;
+
+	/*
+	 * We have often (>50%) hit the page the previous scan ended on, so
+	 * cache the current (last) page of the scan for future use.
+	 */
+	if (RescanMayHitSamePage(btso))
+	{
+		/*
+		 * If we have a valid cache, and the cache contains this page, then
+		 * we don't have anything to do.
+		 */
+		if (prevEndPage == endPage && btso->pageCacheValid)
+		{
+			/* do nothing */
+		}
+		else
+		{
+			memcpy(btso->pageCache, page, BLCKSZ);
+			btso->pageCacheValid = true;
+			btso->recentBuffer = buf;
+			btso->recentEndPage = endPage;
+		}
+	}
+	else if (prevEndPage != endPage)
+	{
+		btso->recentEndPage = endPage;
+		btso->pageCacheValid = false;
+	}
+}
+
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -872,7 +911,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Buffer		buf;
+	Buffer		buf = InvalidBuffer;
 	BTStack		stack;
 	OffsetNumber offnum;
 	StrategyNumber strat;
@@ -1371,13 +1410,59 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	inskey.keysz = keysCount;
 
 	/*
-	 * Use the manufactured insertion scan key to descend the tree and
-	 * position ourselves on the target leaf page.
+	 * If we've restarted the scan through amrescan, it is quite possible
+	 * that a previous scan ended on the same page we're trying to find.
+	 * If this happened repeatedly, we cache the last page of the scan,
+	 * so that we may be able to ignore the penalty of traversing the tree
+	 * from the top.
+	 * 
+	 * XXX: Maybe this might not be concurrency-safe? Haven't thought about it
+	 * quite yet.
 	 */
-	stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	if (RescanMayHitSamePage(so) && so->pageCacheValid)
+	{
+		Page	page = so->pageCache;
+		BTPageOpaque opaque = BTPageGetOpaque(page);
+
+		/* cached page is a leaf page, and is not empty */
+		Assert(P_ISLEAF(opaque) && PageGetMaxOffsetNumber(page) != 0);
+
+		/*
+		 * If the search key doesn't fit within the min/max of this page,
+		 * we continue with normal index descent.
+		 */
+		if (_bt_compare(rel, &inskey, page, P_HIKEY) >= 0)
+			goto nocache;
+		if (_bt_compare(rel, &inskey, page, PageGetMaxOffsetNumber(page)) <= 0)
+			goto nocache;
+
+		buf = _bt_getrecentbuf(rel, so->recentEndPage, so->recentBuffer,
+							   BT_READ);
 
-	/* don't need to keep the stack around... */
-	_bt_freestack(stack);
+		/*
+		 * It is possible the page has split in the meantime, so we may have
+		 * to move right
+		 */
+		buf = _bt_moveright(rel, NULL, &inskey, buf, false, NULL, BT_READ,
+							scan->xs_snapshot);
+		so->rescanSamePage += 1;
+	}
+
+nocache:
+	if (!BufferIsValid(buf))
+	{
+		/*
+		 * Use the manufactured insertion scan key to descend the tree and
+		 * position ourselves on the target leaf page.
+		 */
+		stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ, scan->xs_snapshot);
+
+		/* don't need to keep the stack around... */
+		_bt_freestack(stack);
+
+		if (buf == so->recentBuffer)
+			so->rescanSamePage += 1;
+	}
 
 	if (!BufferIsValid(buf))
 	{
@@ -1792,6 +1877,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
+	if (!continuescan)
+		_bt_endscanonpage(so, so->currPos.buf, so->currPos.currPage, page);
+
 	return (so->currPos.firstItem <= so->currPos.lastItem);
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa7973..b76d433725 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1062,6 +1062,26 @@ typedef struct BTScanOpaqueData
 	char	   *currTuples;		/* tuple storage for currPos */
 	char	   *markTuples;		/* tuple storage for markPos */
 
+	/*
+	 * If we're on the outer side of a loop join, parameterized joins may
+	 * have some correlation, i.e. repeated close values which fit on
+	 * nearby pages. By tracking at which leaf page we start and end our
+	 * scans, we can detect this case and (in some cases) reduce buffer
+	 * accesses by a huge margin.
+	 * 
+	 * TODO: Caching this page locally is OK, because this is inside a query
+	 * context and thus bound to the lifetime and snapshot of the query.
+	 * Accessing an updated version of the page might not be, so that needs
+	 * checking.
+	 */
+	int			rescans;		/* number of rescans */
+	int			rescanSamePage;	/* number of rescans that started on the previous scan's end location */
+
+	BlockNumber	recentEndPage;	/* scan end page of previous scan */
+	Buffer		recentBuffer;	/* buffer of recentEndPage. May not match. */
+	bool		pageCacheValid;	/* is the cached page valid? */
+	char	   *pageCache;		/* copy of the previous scan's last page, if valid */
+
 	/*
 	 * If the marked position is on the same page as current position, we
 	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
@@ -1207,6 +1227,7 @@ extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getrecentbuf(Relation rel, BlockNumber blkno, Buffer buffer, int access);
 extern Buffer _bt_allocbuf(Relation rel, Relation heaprel);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
-- 
2.40.1

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Matthias van de Meent (#6)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:

The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.

A lot of people (myself included) were confused on this point for
quite a while. To make matters even more confusing, one of the really
compelling cases for the MDAM design is scans that feed into
GroupAggregates -- preserving index sort order for naturally big index
scans will tend to enable it. One of my examples from the start of
this thread showed just that. (It just so happened that that example
was faster because of all the "skipping" that nbtree *wasn't* doing
with the patch.)

Yes, that's why I asked: The MDAM paper's examples seem to materialize
the full predicate up-front, which would require a product of all
indexed columns' quals in size, so that materialization has a good
chance to get really, really large. But if we're not doing that
materialization upfront, then there is no issue with resource
consumption (except CPU time, which can likely be improved with other
methods)

I get why you asked. I might have asked the same question.

As I said, the MDAM paper has *surprisingly* little to say about
B-Tree executor stuff -- it's almost all just describing the
preprocessing/transformation process. It seems as if optimizations
like the one from my patch were considered too obvious to talk about
and/or out of scope by the authors. Thinking about the MDAM paper like
that was what made everything fall into place for me. Remember,
"missing key predicates" isn't all that special.

--
Peter Geoghegan

Matthias van de Meent

boekewurm+postgres@gmail.com

over 2 years ago

In reply to: Peter Geoghegan (#8)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, 27 Jul 2023 at 16:01, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:

The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.

A lot of people (myself included) were confused on this point for
quite a while.

I've taken the liberty to update the "Loose indexscan" wiki page [0]https://wiki.postgresql.org/wiki/Loose_indexscan,
adding detail that Loose indexscans are distinct from Skip scans, and
showing some high-level distinguishing properties.
I also split the TODO entry for `` "loose" or "skip" scan `` into two,
and added links to the relevant recent threads so that it's clear
these are different (and that some previous efforts may have had a
confusing name).

I hope this will reduce the chance of future confusion between the two
different approaches to improving index scan performance.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: https://wiki.postgresql.org/wiki/Loose_indexscan

#10

Alena Rybakina

lena.ribackina@yandex.ru

over 2 years ago

In reply to: Peter Geoghegan (#1)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Hi, all!

CNF -> DNF conversion
=====================

Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".

In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).

The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]. Much work remains here.

Honestly, I'm just reading and delving into this thread and other topics
related to it, so excuse me if I ask you a few obvious questions.

I noticed that you are going to add CNF->DNF transformation at the index
construction stage. If I understand correctly, you will rewrite
restrictinfo node,
change boolean "AND" expressions to "OR" expressions, but would it be
possible to apply such a procedure earlier? Otherwise I suppose you
could face the problem of
incorrect selectivity of the calculation and, consequently, the
cardinality calculation?
I can't clearly understand at what stage it is clear that the such a
transformation needs to be applied?

--
Regards,
Alena Rybakina
Postgres Professional

#11

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Matthias van de Meent (#7)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Jul 27, 2023 at 10:00 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

My idea is not quite block nested loop join. It's more 'restart the
index scan at the location the previous index scan ended, if
heuristics say there's a good chance that might save us time'. I'd say
it is comparable to the fast tree descent optimization that we have
for endpoint queries, and comparable to this patch's scankey
optimization, but across AM-level rescans instead of internal rescans.

Yeah, I see what you mean. Seems related, even though what you've
shown in your prototype patch doesn't seem like it fits into my
taxonomy very neatly.

(BTW, I was a little confused by the use of the term "endpoint" at
first, since there is a function that uses that term to refer to a
descent of the tree that happens without any insertion scan key. This
path is used whenever the best we can do in _bt_first is to descend to
the rightmost or leftmost page.)

The basic design of that patch is this: We keep track of how many
times we've rescanned, and the end location of the index scan. If a
new index scan hits the same page after _bt_search as the previous
scan ended, we register that.

I can see one advantage that block nested loop join would retain here:
it does block-based accesses on both sides of the join. Since it
"looks ahead" on both sides of the join, more repeat accesses are
likely to be avoided.

Not too sure how much that matters in practice, though.

--
Peter Geoghegan

#12

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Alena Rybakina (#10)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Jul 31, 2023 at 12:24 PM Alena Rybakina
<lena.ribackina@yandex.ru> wrote:

I noticed that you are going to add CNF->DNF transformation at the index
construction stage. If I understand correctly, you will rewrite
restrictinfo node,
change boolean "AND" expressions to "OR" expressions, but would it be
possible to apply such a procedure earlier?

Sort of. I haven't really added any new CNF->DNF transformations. The
code you're talking about is really just checking that every index
path has clauses that we know that nbtree can handle. That's a big,
ugly modularity violation -- many of these details are quite specific
to the nbtree index AM (in theory we could have other index AMs that
are amsearcharray).

At most, v1 of the patch makes greater use of an existing
transformation that takes place in the nbtree index AM, as it
preprocesses scan keys for these types of queries (it's not inventing
new transformations at all). This is a slightly creative
interpretation, too. Tom's commit 9e8da0f7 didn't actually say
anything about CNF/DNF.

Otherwise I suppose you
could face the problem of
incorrect selectivity of the calculation and, consequently, the
cardinality calculation?

I can't think of any reason why that should happen as a direct result
of what I have done here. Multi-column index paths + multiple SAOP
clauses are not a new thing. The number of rows returned does not
depend on whether we have some columns as filter quals or not.

Of course that doesn't mean that the costing has no problems. The
costing definitely has several problems right now.

It also isn't necessarily okay that it's "just as good as before" if
it turns out that it needs to be better now. But I don't see why it
would be. (Actually, my hope is that selectivity estimation might be
*less* important as a practical matter with the patch.)

I can't clearly understand at what stage it is clear that the such a
transformation needs to be applied?

I don't know either.

I think that most of this work needs to take place in the nbtree code,
during preprocessing. But it's not so simple. There is a mutual
dependency between the code that generates index paths in the planner
and nbtree scan key preprocessing. The planner needs to know what
kinds of index paths are possible/safe up-front, so that it can choose
the fastest plan (the fastest that the index AM knows how to execute
correctly). But, there are lots of small annoying nbtree
implementation details that might matter, and can change.

I think we need to have nbtree register a callback, so that the
planner can initialize some preprocessing early. I think that we
require a "two way conversation" between the planner and the index AM.

--
Peter Geoghegan

#13

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Peter Geoghegan (#3)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Jul 26, 2023 at 6:41 AM Peter Geoghegan <pg@bowt.ie> wrote:

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries.

What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all. Plus you can skip over them using the
next index tuple. So skipping works both ways.

Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values). This approach was necessary for fairly
mundane reasons (it limits the amount of work required while holding a
buffer lock), but it also solves quite a few other problems that I
find far more interesting.

It's easy to imagine the state machine from v2 of the patch being
extended for skip scan. My approach "abstracts away" the arrays. For
skip scan, it would more or less behave as if the user had written a
query "WHERE a in (<Every possible value for this column>) AND b = 5
... " -- without actually knowing what the so-called array keys for
the high-order skipped column are (not up front, at least). We'd only
need to track the current "array key" for the scan key on the skipped
column, "a". The state machine would notice when the scan had reached
the next-greatest "a" value in the index (whatever that might be), and
then make that the current value. Finally, the state machine would
effectively instruct its caller to consider repositioning the scan via
a new descent of the index. In other words, almost everything for skip
scan would work just like regular SAOPs -- and any differences would
be well encapsulated.

But it's not just skip scan. This approach also enables thinking of
SAOP index scans (using nbtree) as just another type of indexable
clause, without any special restrictions (compared to true indexable
operators such as "=", say). Particularly in the planner. That was
always the general thrust of teaching nbtree about SAOPs, from the
start. But it's something that should be totally embraced IMV. That's
just what the patch proposes to do.

In particular, the patch now:

1. Entirely removes the long-standing restriction on generating path
keys for index paths with SAOPs, even when there are inequalities on a
high order column present. You can mix SAOPs together with other
clause types, arbitrarily, and everything still works and works
efficiently.

For example, the regression test expected output for this query/test
(from bugfix commit 807a40c5) is updated by the patch, as shown here:

 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                                      QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY
('{1001,3000}'::integer[])))
-(4 rows)
+                                   QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)

We don't need a sort node anymore -- even though the leading column
here (thousand) uses an inequality, a particularly tricky case. Now
it's an index scan, much like any other, with no particular
restrictions caused by using a SAOP.

2. Adds an nbtree strategy for non-required equality array scan keys,
which is built on the same state machine, with only minor differences
to deal with column values "appearing out of key space order".

3. Simplifies the optimizer side of things by consistently avoiding
filter quals (except when it's truly unavoidable). The optimizer
doesn't even consider alternative index paths with filter quals for
lower-order SAOP columns, because they have no possible advantage
anymore. On the other hand, as we saw already, upthread, filter quals
have huge disadvantages. By always using true index quals, we
automatically avoid any question of getting excessive amounts of heap
page accesses just to eliminate non-matching rows. AFAICT we don't
need to make a trade-off here.

The first version of the patch added some crufty code to the
optimizer, to account for various restrictions on sort order. This
revised version actually removes existing cruft from the same place
(indxpath.c) instead.

Items 1, 2, and 3 are all closely related. Take the query I've shown
for item 1. Bugfix commit 807a40c5 (which added the test query in
question) dealt with an oversight in the then-recent original nbtree
SAOP patch (commit 9e8da0f7): when nbtree combines two primitive index
scans with an inequality on their leading column, we cannot be sure
that the output will appear in the same order as the order that one
big continuous index scan returns rows in. We can only expect to
maintain the illusion that we're doing one continuous index scan when
individual primitive index scans access earlier columns via the
equality strategy -- we need "equality constraints".

In practice, the optimizer (indxpath.c) is very conservative (more
conservative than it really needs to be) when it comes to trusting the
index scan to output rows in index order, in the presence of SAOPs.
All of that now seems totally unnecessary. Again, I don't see a need
to make a trade-off here.

My observation about this query (and others like it) is: why not
literally perform one continuous index scan instead (not multiple
primitive index scans)? That is strictly better, given all the
specifics here. Once we have a way to do that (which the nbtree
executor work listed under item 2 provides), it becomes safe to assume
that the tuples will be output in index order -- there is no illusion
left to preserve. Who needs an illusion that isn't actually helping
us? We actually do less I/O by using this strategy, for the usual
reasons (we can avoid repeating index page accesses).

A more concrete benefit of the non-required-scankeys stuff can be seen
by running Benoit Tigeot's test case [1]https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491 -- Peter Geoghegan with v2. He had a query like
this:

SELECT * FROM docs
WHERE status IN ('draft', 'sent') AND
sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280')
ORDER BY sent_at DESC NULLS LAST LIMIT 20;

And, his test case had an index on "sent_at DESC NULLS LAST,
sender_reference, status". This variant was a weak spot for v1.

v2 of the patch is vastly more efficient here, since we don't have to
go to the heap to eliminate non-matching tuples -- that can happen in
the index AM instead. This can easily be 2x-3x faster on a warm cache,
and have *hundreds* of times fewer buffer accesses (which Benoit
verified with an early version of this v2). All because we now require
vastly less heap access -- the quals are fairly selective here, and we
have to scan hundreds of leaf pages before the scan can terminate.
Avoiding filter quals is a huge win.

This particular improvement is hard to squarely attribute to any one
of my 3 items. The immediate problem that the query presents us with
on the master branch is the problem of filter quals that require heap
accesses to do visibility checks (a problem that index quals can never
have). That makes it tempting to credit my item 3. But you can't
really have item 3 without also having items 1 and 2. Taken together,
they eliminate all possible downsides from using index quals.

That high level direction (try to have one good choice for the
optimizer) seems important to me. Both for this project, and in
general.

Other changes in v2:

* Improved costing, that takes advantage of the fact that nbtree now
promises to not repeat any leaf page accesses (unless the scan is
restarted or the direction of the scan changes). This makes the worst
case far more predictable, and more related to selectivity estimation
-- you can't scan more pages than you have in the whole index. Just
like with every other sort of index scan.

* Support for parallel index scans.

The existing approach to array keys for parallel index scan has been
adopted to work with individual primitive index scans, not individual
array keys. I haven't tested this very thoroughly just yet, but it
seems to work well enough already. I think that it's important to not
have very much variation between parallel and serial index scans,
which I seem to have mostly avoided.

[1]: https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491 -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v2-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v2-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 7d8041cbf41736981431a0d063e5ecdc592402ee Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v2] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals.  This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.

The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access).  The general idea of giving the index AM this
additional context can be pushed a lot further, though.

Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime.  The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys.  This approach is
far more flexible, and can be far more efficient.  Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.

Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals.  Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute.  These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.

nbtree is now strictly guaranteed to avoid all repeat accesses to any
individual leaf page, even in cases with inequalities on high order
columns (except when the scan direction changes, or the scan restarts).
We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses.  The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans.  The added cost from repeatedly descending the
index still matters, but it can never dominate.

An important goal of this work is to remove all ScalarArrayOpExpr clause
special cases from the planner -- ScalarArrayOpExpr clauses can now be
thought of a generalization of simple equality clauses (except when
costing index scans, perhaps).  The planner no longer needs to generate
alternative index paths with filter quals/qpquals.  We assume that true
SAOP index quals are strictly better than filter/qpquals, since the work
in nbtree guarantees that they'll be at least slightly faster.

Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements.  They benefit indirectly.
The planner no longer shows any restraint around making SAOP clauses
into true nbtree index quals, which tends to result in significant
savings on heap page accesses.  In general we never need visibility
checks to evaluate true index quals, whereas filter quals often need to
perform extra heap accesses, just to eliminate non-matching tuples
(expression evaluation is only safe with known visible tuples).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   37 +-
 src/backend/access/nbtree/nbtree.c         |   58 +-
 src/backend/access/nbtree/nbtsearch.c      |   72 +-
 src/backend/access/nbtree/nbtutils.c       | 1312 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   64 +-
 src/backend/utils/adt/selfuncs.c           |  123 +-
 src/test/regress/expected/create_index.out |   61 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   20 +-
 9 files changed, 1462 insertions(+), 290 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f5c66964c..6ab5be544 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1045,9 +1045,11 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	int			numPrimScans;	/* count indicating number of primitive index
+								 * scans for array scan keys */
+	bool		needPrimScan;	/* Perform another primitive scan? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1078,6 +1080,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the page high key.  This must happen before the
+ * first call to _bt_checkkeys.  _bt_checkkeys uses this information to manage
+ * advancement of the scan's array keys.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	highkey;		/* page high key, set by forward scans */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		highkeychecked; /* high key checked against current
+								 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1155,7 +1180,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1248,12 +1273,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 62bc9917f..5c1840436 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,9 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  We reach this state once for every distinct
+ * primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +70,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans for array scan keys */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -276,7 +277,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (res)
 			break;
 		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -334,7 +335,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 			}
 		}
 		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -365,7 +366,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -405,7 +408,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->numPrimScans = 0;
+	so->needPrimScan = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -586,7 +590,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -612,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -623,7 +627,17 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
+ *
+ * XXX This particular aspect of the patch is still at the proof of concept
+ * stage.  Having this much available for review at least suggests that it'll
+ * be feasible to port the existing parallel scan array scan key stuff over to
+ * using a primitive index scan counter (as opposed to an array key counter)
+ * the top-level scan.  I have yet to really put this code through its paces.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -654,7 +668,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
 			/* Parallel scan has already advanced to a new set of scankeys. */
 			status = false;
@@ -695,9 +709,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -731,12 +748,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -750,14 +766,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -766,13 +782,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 17ad89749..d51bc458b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -879,6 +879,18 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
+	/*
+	 * XXX Queries with SAOPs have always accounted for each call here as one
+	 * "index scan".  This meant that the accounting showed one index scan per
+	 * distinct SAOP constant.  This approach is consistent with how it was
+	 * done before nbtree was taught to handle ScalarArrayOpExpr quals itself
+	 * (it's also how non-amsearcharray index AMs still do it).
+	 *
+	 * Right now, eliding a primitive index scan elides a call here, resulting
+	 * in one less "index scan" recorded by pgstat.  This seems defensible,
+	 * though not necessarily desirable.  Now implementation details can have
+	 * a significant impact on user-visible index scan counts.
+	 */
 	pgstat_count_index_scan(rel);
 
 	/*
@@ -952,6 +964,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * one we use --- by definition, they are either redundant or
 	 * contradictory.
 	 *
+	 * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+	 * used to avoid prematurely stopping the scan when an array equality qual
+	 * has its array keys advanced.
+	 *
 	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
 	 * If the index stores nulls at the end of the index we'll be starting
 	 * from, and we have no boundary key for the column (which means the key
@@ -1536,9 +1552,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
+	BTReadPageState pstate;
 	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1558,8 +1573,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	pstate.dir = dir;
+	pstate.highkey = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.highkeychecked = false;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1594,6 +1612,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY scans must provide high key up front */
+		if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1616,7 +1642,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+			if (_bt_checkkeys(scan, &pstate, itup, false))
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
@@ -1649,7 +1675,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1666,17 +1692,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
-			ItemId		iid = PageGetItemId(page, P_HIKEY);
-			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
-			int			truncatt;
+			IndexTuple	itup;
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+			if (pstate.highkey)
+				itup = pstate.highkey;
+			else
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				itup = (IndexTuple) PageGetItem(page, iid);
+			}
+
+			_bt_checkkeys(scan, &pstate, itup, true);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1697,6 +1729,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1707,12 +1740,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1724,8 +1763,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup);
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1764,7 +1802,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4d..c99518352 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,33 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(ScanKey cur, FmgrInfo *orderproc,
+										   Datum datum, bool null,
+										   Datum arrdatum);
+static int	_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   FmgrInfo *orderproc, Datum datum, bool null,
+								   int32 *final_result);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, bool skrequiredtrigger);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool *continuescan, bool *skrequiredtrigger);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -202,6 +220,21 @@ _bt_freestack(BTStack stack)
  * array keys, it's sufficient to find the extreme element value and replace
  * the whole array with that scalar value.
  *
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved).  It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
+ * We always sort and deduplicate arrays up-front for equality array keys.
+ * ScalarArrayOpExpr execution need only visit leaf pages that might contain
+ * matches exactly once, while preserving the sort order of the index.  This
+ * isn't just about performance; it also avoids needing duplicate elimination
+ * of matching TIDs (we prefer deduplicating search keys once, up-front).
+ * Equality SK_SEARCHARRAY keys are disjuncts that we always process in
+ * index/key space order, which makes this general approach feasible.  Every
+ * index tuple will match no more than one single distinct combination of
+ * equality-constrained keys (array keys and other equality keys).
+ *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
@@ -212,6 +245,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
 	ScanKey		cur;
 	int			i;
@@ -265,6 +299,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc(nkeyatts * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -281,6 +316,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.
+		 *
+		 * XXX Clean this up some more.  This repeats some of the same work
+		 * when there are multiple scan keys for the same key column.
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+			_bt_sort_cmp_func_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -436,6 +482,42 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	return result;
 }
 
+/*
+ * Look up the appropriate comparison function in the opfamily.
+ *
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
+ */
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	Oid			elemtype;
+	RegProcedure cmp_proc;
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
+
+	/*
+	 * Determine the nominal datatype of the array elements.  We have to
+	 * support the convention that sk_subtype == InvalidOid means the opclass
+	 * input type; this is a hack to simplify life for ScanKeyInit().
+	 */
+	elemtype = skey->sk_subtype;
+	if (elemtype == InvalidOid)
+		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 rel->rd_opcintype[skey->sk_attno - 1],
+								 elemtype,
+								 BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+			 BTORDER_PROC, elemtype, elemtype,
+			 rel->rd_opfamily[skey->sk_attno - 1]);
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
 /*
  * _bt_sort_array_elements() -- sort and de-dup array elements
  *
@@ -450,42 +532,14 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 						bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -507,7 +561,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -515,6 +569,171 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This code is loosely based on _bt_compare.  However, there are some
+ * important differences.
+ *
+ * It is convenient to think of calling _bt_compare as comparing caller's
+ * insertion scankey to an index tuple.  But our callers are not searching
+ * through the index at all -- they're searching through a local array of
+ * datums associated with a scan key (using values they've taken from an index
+ * tuple).  This is a complete reversal of how things usually work, which can
+ * be confusing.
+ *
+ * Callers of this function should think of it as comparing "datum" (as well
+ * as "null") to "arrdatum".  This is the same approach that _bt_compare takes
+ * in that both functions compare the value that they're searching for to one
+ * particular item used as a binary search pivot.  (But it's the wrong way
+ * around if you think of it as "tuple values vs scan key values".  So don't.)
+*/
+static inline int32
+_bt_compare_array_skey(ScanKey cur,
+					   FmgrInfo *orderproc,
+					   Datum datum,
+					   bool null,
+					   Datum arrdatum)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur->sk_flags & SK_ISNULL)	/* array/scan key is NULL */
+	{
+		if (null)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NULL "<" NOT_NULL */
+		else
+			result = -1;		/* NULL ">" NOT_NULL */
+	}
+	else if (null)				/* array/scan key is NOT_NULL and tuple item
+								 * is NULL */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NOT_NULL ">" NULL */
+		else
+			result = 1;			/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index
+		 * tuple.  (Array scan keys cannot be cross-type, but other required
+		 * scan keys that use an equal operator can be.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 datum, arrdatum));
+
+		/*
+		 * Unlike _bt_compare, we flip the sign when column is a DESC column
+		 * (and *not* when column is ASC).  This matches the approach taken by
+		 * _bt_check_rowcompare, which performs similar three-way comparisons.
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  This allows searches against required scan key arrays to
+ * reuse the work of earlier searches, at least in many important cases.
+ * Array keys covering key space that the index scan already processed cannot
+ * possibly contain any matches.
+ *
+ * XXX There are several fairly obvious optimizations that we could apply here
+ * (e.g., precheck searches for earlier subsets of a larger array would help).
+ * Revisit this during the next round of performance validation.
+ *
+ * Returns an index to the first array element >= caller's datum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to searched-for datum.
+ */
+static int
+_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   FmgrInfo *orderproc, Datum datum, bool null,
+					   int32 *final_result)
+{
+	int			low_elem,
+				high_elem,
+				first_elem_dir,
+				result = 0;
+	bool		knownequal = false;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		first_elem_dir = 0;
+		low_elem = array->cur_elem;
+		high_elem = array->num_elems - 1;
+		if (cur_elem_start)
+			low_elem = 0;
+	}
+	else
+	{
+		first_elem_dir = array->num_elems - 1;
+		low_elem = 0;
+		high_elem = array->cur_elem;
+		if (cur_elem_start)
+		{
+			low_elem = 0;
+			high_elem = first_elem_dir;
+		}
+	}
+
+	while (high_elem > low_elem)
+	{
+		int			mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		Datum		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(cur, orderproc, datum, null, arrdatum);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * there each element is guaranteed to be unique.  We can quit as
+			 * soon as we see an equal array, saving ourselves an extra
+			 * comparison or two...
+			 */
+			low_elem = mid_elem;
+			knownequal = true;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ... but our caller also cares about the position of the searched-for
+	 * datum relative to the low_elem match we'll return.  Make sure that we
+	 * set *final_result to the result that comes from comparing low_elem's
+	 * key value to the datum that caller had us search for.
+	 */
+	if (!knownequal)
+		result = _bt_compare_array_skey(cur, orderproc, datum, null,
+										array->elem_values[low_elem]);
+
+	*final_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -541,70 +760,20 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 	}
 }
 
-/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
- *
- * Returns true if there is another set of values to consider, false if not.
- * On true result, the scankeys are initialized with the next set of values.
- */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
-
-	/*
-	 * We must advance the last array key most quickly, since it will
-	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
-	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
-	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			cur_elem = curArrayKey->cur_elem;
-		int			num_elems = curArrayKey->num_elems;
-
-		if (ScanDirectionIsBackward(dir))
-		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-		else
-		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-
-		curArrayKey->cur_elem = cur_elem;
-		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
-
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
-
-	return found;
-}
-
 /*
  * _bt_mark_array_keys() -- Handle array keys during btmarkpos
  *
  * Save the current state of the array keys as the "mark" position.
+ *
+ * XXX The current set of array keys are not independent of the current scan
+ * position, so why treat them that way?
+ *
+ * We shouldn't even bother remembering the current array keys when btmarkpos
+ * is called.  The array keys should be handled lazily instead.  If and when
+ * btrestrpos is called, it can just set every array's cur_elem to the first
+ * element for the current scan direction.  When _bt_advance_array_keys is
+ * reached (during the first call to _bt_checkkeys that follows), it will
+ * automatically search for the relevant array keys using caller's tuple.
  */
 void
 _bt_mark_array_keys(IndexScanDesc scan)
@@ -660,6 +829,749 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	}
 }
 
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) might need to advance the scan's array
+ * keys.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfy our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it might be time for our
+ * caller to advance the array keys to the next set.
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	bool		tuple_before_array_keys = false;
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		datum;
+		bool		null,
+					skrequired;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Unlike _bt_advance_array_keys, we never deal with any non-required
+		 * array keys.  Cases where skrequiredtrigger is set to false by
+		 * _bt_check_compare should never call here.  We are only called after
+		 * _bt_check_compare provisionally indicated that the scan should be
+		 * terminated due to a _required_ scan key not being satisfied.
+		 *
+		 * We expect _bt_check_compare to notice and report required scan keys
+		 * before non-required ones.  _bt_advance_array_keys might still have
+		 * to advance non-required array keys in passing for a tuple that we
+		 * were called for, but _bt_advance_array_keys doesn't rely on us to
+		 * give it advanced notice of that.
+		 */
+		if (!skrequired)
+			break;
+
+		if (attnum > ntupatts)
+		{
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's search-type scan keys
+			 */
+			break;
+		}
+
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(cur, orderproc,
+										datum, null,
+										cur->sk_argument);
+
+		if (result != 0)
+		{
+			if (ScanDirectionIsForward(dir))
+				tuple_before_array_keys = result < 0;
+			else
+				tuple_before_array_keys = result > 0;
+
+			break;
+		}
+	}
+
+	return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if there is another set of values to consider, false if not.
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * opportunistically advancing the scan's array keys when it allows the
+	 * primitive index scan to find nearby matching tuples (or to eliminate
+	 * array keys with no matching tuples from further consideration).
+	 *
+	 * _bt_checkkeys sets a simple flag variable that we check here.  This
+	 * tells us if we need to perform another primitive index scan for the
+	 * now-current array keys or not.  We'll unset the flag once again to
+	 * acknowledge having started a new primitive scan (or we'll see that it
+	 * isn't set and end the top-level scan right away).
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys here.  There
+	 * are various scenarios where that won't happen.  For example, if the
+	 * index is completely empty, then _bt_first won't get as far as calling
+	 * _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  There won't a high key call to _bt_checkkeys if the only
+	 * call to _bt_readpage is for the rightmost page, if _bt_binsrch told
+	 * _bt_readpage to start at the very end of the rightmost page.  There is
+	 * a similar issue for backwards scans, too.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan. This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for later set of array keys.
+	 *
+	 * There is one exception, that requires handling by us as a special case:
+	 * the case where _bt_first's call to _bt_preprocess_keys determined that
+	 * the scan keys for its would-be scan can never be satisfied.  That might
+	 * be true for one set of array keys, but not the next set.  This is the
+	 * only case where we advance the array keys for ourselves, rather than
+	 * leaving it up to _bt_checkkeys.
+	 */
+	if (!so->qual_ok)
+	{
+		/* _bt_first backed out; increment array keys, and try again */
+		so->needPrimScan = false;
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+	}
+
+	/* Time for another primitive index scan? */
+	if (so->needPrimScan)
+	{
+		/* Begin primitive index scan */
+		so->needPrimScan = false;
+
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Just terminate the top-level scan.
+	 */
+	_bt_advance_array_keys_to_end(scan, dir);
+
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false.  When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple.  It
+ * is okay to let the second call set pstate.continuescan=false without
+ * further intervention, since we know that it can only be for a scan key that
+ * is required in one direction.
+ *
+ * When called with skrequiredtrigger, we don't expect to have to advance any
+ * non-required scan keys.  We'll always set pstate.continuescan because a
+ * non-required scan key can never terminate the scan.
+ *
+ * Required array keys are always advanced to the highest element >= the
+ * corresponding tuple attribute values for its most significant non-equal
+ * column (or the next lowest set <= the tuple value during backwards scans).
+ * If we reach the end of the array keys for the current scan direction, we
+ * end the top-level index scan.
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys (or <= during backward
+ * scans).  This must be established first, before calling here.
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple.  We fall back on "incrementally" advancing the
+ * array keys in these cases, which involve inequality strategy scan keys.
+ * For example, with a composite index on (a, b) and a qual "WHERE a IN (3,5)
+ * AND b < 42", we'll be called for both "a" arry keys (keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42".  Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called.  The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)".  The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time.  This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys.  We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are.  We must not suppress cases where a call to
+ * _bt_check_compare sets continuescan=false for a required scan key that uses
+ * the equality strategy (only inequality-type scan keys get that treatment).
+ * We don't want to suppress the scan's termination when it's inappropriate.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, bool skrequiredtrigger)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_done = false,
+				all_skrequired_atts_wrapped = skrequiredtrigger,
+				all_atts_equal = true;
+
+	Assert(so->numberOfKeys > 0);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->qual_ok);
+
+	/*
+	 * Try to advance array keys via a series of binary searches.  We'll
+	 * perform one search for each SK_SEARCHARRAY scan key (excluding array
+	 * quals that don't use an equality type operator/strategy, which aren't
+	 * backed by an array at all).
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno,
+					first_elem_dir,
+					final_elem_dir,
+					set_elem;
+		Datum		datum;
+		bool		skrequired,
+					null;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 *
+		 * Deliberately check this both here and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * We need to check required non-array scan keys (that use the equal
+		 * strategy), as well as required and non-required array scan keys
+		 * (also limited to those that use the equal strategy, since array
+		 * inequalities degenerate into a simple comparison).
+		 *
+		 * Perform initial set up for this scan key.  If it is backed by an
+		 * array then we need to set variables describing the current position
+		 * in the array.
+		 *
+		 * This loop iterates through the current scankeys (so->keyData, which
+		 * were output by _bt_preprocess_keys earlier) and then sets input
+		 * scan keys (so->arrayKeyData scan keys) to new array values.  This
+		 * sets things up for the next _bt_preprocess_keys call.
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		first_elem_dir = 0;		/* keep compiler quiet */
+		final_elem_dir = 0;		/* keep compiler quiet */
+		if (cur->sk_flags & SK_SEARCHARRAY)
+		{
+			/* Set up array comparison function */
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+
+			/*
+			 * It's possible that _bt_preprocess_keys determined that an
+			 * individual array scan key wasn't required in so->keyData for
+			 * the ongoing primitive index scan due to it being redundant or
+			 * contradictory (the current array value might be redundant next
+			 * to some other scan key on the same attribute).  Deal with that.
+			 */
+			if (unlikely(skeyarray->sk_attno != attnum))
+			{
+				bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+
+				for (; arrayidx < so->numArrayKeys; arrayidx++)
+				{
+					array = &so->arrayKeys[arrayidx];
+					skeyarray = &so->arrayKeyData[array->scan_key];
+					if (skeyarray->sk_attno == attnum)
+					{
+						found = true;
+						break;
+					}
+				}
+
+				Assert(found);
+			}
+
+			if (ScanDirectionIsForward(dir))
+			{
+				first_elem_dir = 0;
+				final_elem_dir = array->num_elems - 1;
+			}
+			else
+			{
+				first_elem_dir = array->num_elems - 1;
+				final_elem_dir = 0;
+			}
+		}
+		else if (attnum > ntupatts)
+		{
+			/*
+			 * Nothing needs to be done when we have a truncated attribute
+			 * (possible when caller's tuple is a page high key) and a
+			 * non-array scan key
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * non-equal required scan key.  The first scan key must have been set
+		 * to a value > the value from the tuple back when we dealt with it
+		 * (or, for a backwards scan, to a value < the value from the tuple).
+		 * That needs to "cascade" to lower-order array scan keys.  They must
+		 * be set to the first array element for the current scan direction.
+		 *
+		 * We're still setting the keys to values >= the tuple here -- it just
+		 * needs to work for the tuple as a whole.  For example, when a tuple
+		 * "(a, b) = (42, 5)" advances the array keys on "a" from 40 to 45, we
+		 * must also set "b" to whatever the first array element for "b" is.
+		 * It would be wrong to allow "b" to be set to a value from the tuple,
+		 * since the value is actually from a different part of the key space.
+		 *
+		 * Also defensively do this with truncated attributes when caller's
+		 * tuple is a page high key.
+		 */
+		if (array && ((arrays_advanced && !all_atts_equal) ||
+					  attnum > ntupatts))
+		{
+			/* Shouldn't reach this far for a non-required scan key */
+			Assert(skrequired && skrequiredtrigger && attnum > 1);
+
+			/*
+			 * We set the array to the first element (if needed) here, and we
+			 * don't unset all_required_atts_wrapped.  This array therefore
+			 * counts as a wrapped array when we go on to determine if all of
+			 * the required arrays have wrapped (after this loop).
+			 */
+			if (array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Going to compare scan key to corresponding tuple attribute value
+		 */
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		if (!array)
+		{
+			if (!skrequired || !all_atts_equal)
+				continue;
+
+			/*
+			 * This is a required non-array scan key that uses the equal
+			 * strategy.  See header comments for an explanation of why we
+			 * need to do this.
+			 */
+			result = _bt_compare_array_skey(cur, orderproc, datum, null,
+											cur->sk_argument);
+
+			/*
+			 * _bt_tuple_before_array_skeys should always prevent us from
+			 * being called when the current tuple indicates that the scan
+			 * isn't yet ready to have its array keys advanced.  Check with an
+			 * assert.
+			 */
+			Assert((ScanDirectionIsForward(dir) && result >= 0) ||
+				   (ScanDirectionIsBackward(dir) && result <= 0));
+
+			if (result != 0)
+			{
+				/*
+				 * tuple attribute value is > scan key value (or < scan key
+				 * value in the backward scan case).
+				 */
+				all_atts_equal = false;
+				break;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Binary search for an array key >= the tuple value, which we'll then
+		 * set as our current array key (or <= the tuple value if this is a
+		 * backward scan).
+		 *
+		 * The binary search excludes array keys that we've already processed
+		 * from consideration, except with a non-required scan key's array.
+		 * This is not just an optimization -- it's important for correctness.
+		 * It is crucial that required array scan keys only have their array
+		 * keys advanced in the current scan direction.  We need to advance
+		 * required array keys in lock step with the index scan.
+		 *
+		 * Note in particular that arrays_advanced must only be set when the
+		 * array is advanced to a key >= the existing key, or <= for a
+		 * backwards scan.  (Though see notes about wraparound below.)
+		 */
+		set_elem = _bt_binsrch_array_skey(dir, (!skrequired || arrays_advanced),
+										  array, cur, orderproc, datum, null,
+										  &result);
+
+		/*
+		 * Maintain the state that tracks whether all attribute from the tuple
+		 * are equal to the array keys that we've set as current (or existing
+		 * array keys set during earlier calls here).
+		 */
+		if (result != 0)
+			all_atts_equal = false;
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 * Quit before setting the array keys to avoid _bt_preprocess_keys.
+		 *
+		 * Deliberately check this both before and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * If the binary search indicates that the key space for this tuple
+		 * attribute value is > the key value from the final element in the
+		 * array (final for the current scan direction), we handle it by
+		 * wrapping around to the first element of the array.
+		 *
+		 * Wrapping around simplifies advancement with a multi-column index by
+		 * allowing us to treat wrapping a column as advancing the column.  We
+		 * preserve the invariant that a required scan key's array may only be
+		 * ratcheted forward (backwards when the scan direction is backwards),
+		 * while still always being able to "advance" the array at this point.
+		 */
+		if (set_elem == final_elem_dir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+		{
+			/* Perform wraparound */
+			set_elem = first_elem_dir;
+		}
+		else if (skrequired)
+		{
+			/* Won't call _bt_advance_array_keys_to_end later */
+			all_skrequired_atts_wrapped = false;
+		}
+
+		Assert(set_elem >= 0 && set_elem < array->num_elems);
+		if (array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+
+			/*
+			 * We shouldn't have to advance a required array when called due
+			 * to _bt_check_compare determining that a non-required array
+			 * needs to be advanced.  We expect _bt_check_compare to notice
+			 * and report required scan keys before non-required ones.
+			 */
+			Assert(skrequiredtrigger || !skrequired);
+		}
+	}
+
+	if (!skrequiredtrigger)
+	{
+		/*
+		 * Failing to satisfy a non-required array scan key shouldn't ever
+		 * result in terminating the (primitive) index scan
+		 */
+	}
+	else if (all_skrequired_atts_wrapped)
+	{
+		/*
+		 * The binary searches for each tuple's attribute value in the scan
+		 * key's corresponding SK_SEARCHARRAY array all found that the tuple's
+		 * value are "past the end" of the key space covered by each array
+		 */
+		_bt_advance_array_keys_to_end(scan, dir);
+		arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+	else if (!arrays_advanced)
+	{
+		/*
+		 * We must always advance the array keys by at least one increment
+		 * (except when called to advance a non-required scan key's array).
+		 *
+		 * We need this fallback for cases where the existing array keys and
+		 * existing required equal-strategy scan keys were fully equal to the
+		 * tuple.  _bt_check_compare may have set continuescan=false due to an
+		 * inequality terminating the scan, which we don't deal with directly.
+		 * (See function's header comments for an example.)
+		 */
+		if (_bt_advance_array_keys_increment(scan, dir))
+			arrays_advanced = true;
+		else
+			arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+
+	/*
+	 * Might make sense to recheck the high key later on in cases where we
+	 * just advanced the keys (unless we were just called to advance the
+	 * scan's non-required array keys)
+	 */
+	if (arrays_advanced && skrequiredtrigger)
+		pstate->highkeychecked = false;
+
+	/*
+	 * If we changed the array keys without exhausting all array keys then we
+	 * need to preprocess our search-type scan keys once more
+	 */
+	Assert(skrequiredtrigger || !arrays_done);
+	if (arrays_advanced && !arrays_done)
+	{
+		/*
+		 * XXX Think about buffer-lock-held hazards here some more.
+		 *
+		 * In almost all interesting cases we only really need to copy over
+		 * the array values (from "so->arrayKeyData" to "so->keyData").  But
+		 * there are at least some cases where preprocessing scan keys to
+		 * notice redundant and contradictory keys might be interesting here.
+		 */
+		_bt_preprocess_keys(scan);
+	}
+
+	/* Are we now done with the top-level scan (barring a btrescan)? */
+	Assert(!so->needPrimScan);
+	if (!so->qual_ok)
+	{
+		/* Not when we have unsatisfiable quals for new array keys, ever */
+		Assert(skrequiredtrigger);
+
+		pstate->continuescan = false;
+		pstate->highkeychecked = true;
+		all_atts_equal = false; /* at least not now */
+
+		if (_bt_advance_array_keys_increment(scan, dir))
+			so->needPrimScan = true;
+	}
+	else if (!skrequiredtrigger)
+	{
+		/* Not when we failed to satisfy a non-required scan key, ever */
+		Assert(!arrays_done);
+		pstate->continuescan = true;
+	}
+	else if (arrays_done)
+	{
+		/*
+		 * Yep -- this primitive scan was our last
+		 */
+		Assert(!all_atts_equal);
+		pstate->continuescan = false;
+	}
+	else if (!all_atts_equal)
+	{
+		/*
+		 * Not done.  The top-level index scan (and primitive index scan) will
+		 * continue, since the array keys advanced.
+		 */
+		Assert(arrays_advanced);
+		pstate->continuescan = true;
+
+		/*
+		 * Some required array keys might have wrapped around during this
+		 * call, but it can't have been the most significant array scan key.
+		 */
+		Assert(!all_skrequired_atts_wrapped);
+	}
+	else
+	{
+		/*
+		 * Not done.  A second call to _bt_check_compare must now take place.
+		 * It will make the final decision on setting continuescan.
+		 */
+	}
+
+	return all_atts_equal;
+}
+
+/*
+ * Advance the array keys by a single increment in the current scan direction
+ */
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		found = false;
+	int			i;
+
+	Assert(!so->needPrimScan);
+
+	/*
+	 * We must advance the last array key most quickly, since it will
+	 * correspond to the lowest-order index column among the available
+	 * qualifications. This is necessary to ensure correct ordering of output
+	 * when there are multiple array keys.
+	 */
+	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			cur_elem = curArrayKey->cur_elem;
+		int			num_elems = curArrayKey->num_elems;
+
+		if (ScanDirectionIsBackward(dir))
+		{
+			if (--cur_elem < 0)
+			{
+				cur_elem = num_elems - 1;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+		else
+		{
+			if (++cur_elem >= num_elems)
+			{
+				cur_elem = 0;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+
+		curArrayKey->cur_elem = cur_elem;
+		skey->sk_argument = curArrayKey->elem_values[cur_elem];
+		if (found)
+			break;
+	}
+
+	return found;
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ */
+static void
+_bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(!so->needPrimScan);
+
+	for (int i = 0; i < so->numArrayKeys; i++)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			reset_elem;
+
+		if (ScanDirectionIsForward(dir))
+			reset_elem = curArrayKey->num_elems - 1;
+		else
+			reset_elem = 0;
+
+		if (curArrayKey->cur_elem != reset_elem)
+		{
+			curArrayKey->cur_elem = reset_elem;
+			skey->sk_argument = curArrayKey->elem_values[reset_elem];
+		}
+	}
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -1345,38 +2257,202 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the high
+ * key early, before we've expended too much effort on comparing tuples that
+ * cannot possibly be matches for any set of array keys.  This is just an
+ * optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate.  These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards).  Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction).  Any other order will
+ * lead to inconsistent array key state.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup)
+{
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	int			natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+	bool		skrequiredtrigger;
+
+	Assert(so->qual_ok);
+	Assert(pstate->continuescan);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+							tuple, natts, tupdesc,
+							&pstate->continuescan, &skrequiredtrigger);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.
+	 *
+	 * When there are array scan keys then we can still accept the first
+	 * answer we get from _bt_check_compare when continuescan wasn't unset.
+	 */
+	if (!so->numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare set continuescan=false in the presence of equality
+	 * type array keys.  It's possible that we haven't reached the start of
+	 * the array keys just yet.  It's also possible that we need to advance
+	 * the array keys now.  (Or perhaps we really do need to terminate the
+	 * top-level scan.)
+	 */
+	pstate->continuescan = true;	/* new initial assumption */
+
+	if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+	{
+		/*
+		 * Tuple is still < the current array scan key values (as well as
+		 * other equality type scan keys) if this is a forward scan.
+		 * (Backwards scans reach here with a tuple > equality constraints.)
+		 * We must now consider how to proceed with the ongoing primitive
+		 * index scan.
+		 *
+		 * Should _bt_readpage continue with this page for now, in the hope of
+		 * finding tuples whose key space is covered by the current array keys
+		 * before too long?  Or, should it give up and start a new primitive
+		 * index scan instead?
+		 *
+		 * Our policy is to terminate the primitive index scan at the end of
+		 * the current page if the current (most recently advanced) array keys
+		 * don't cover the final tuple from the page.  This policy is fairly
+		 * conservative.
+		 *
+		 * Note: In some cases we're effectively speculating that the next
+		 * sibling leaf page will have tuples that are covered by the key
+		 * space of our array keys (the current set or some nearby set), based
+		 * on a cue from the current page's final tuple.  There is at least a
+		 * non-zero risk of wasting a page access -- we could gamble and lose.
+		 * The details of all this are handled within _bt_advance_array_keys.
+		 */
+		if (finaltup || (!pstate->highkeychecked && pstate->highkey &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->highkey)))
+		{
+			/*
+			 * This is the final tuple (the high key for a forward scan, or
+			 * the non-pivot tuple at the first offset number for a backward
+			 * scan), and its still before the array keys.  Give up now by
+			 * starting a new primitive index scan.
+			 *
+			 * Have _bt_readpage stop the scan of this page immediately,
+			 * starting a new primitive index scan.  Another primitive index
+			 * scan must be required (if the top-level scan could be
+			 * terminated then we'd have done so by now).
+			 *
+			 * Note: _bt_readpage stashes the page high key, enabling us to
+			 * make this check early in the case of forward scans.  We thereby
+			 * avoid scanning very many extra tuples on the page.  This is
+			 * purely an optimization -- it doesn't affect the behavior of the
+			 * scan (not in a way that can be observed outside of
+			 * _bt_readpage, at least).
+			 */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else if (!finaltup && pstate->highkey)
+		{
+			/*
+			 * Remember that the high key has been checked with this
+			 * particular set of array keys.
+			 *
+			 * It might make sense to check the same high key again at some
+			 * point during the ongoing _bt_readpage-wise scan of this page.
+			 * But it is definitely wasteful to repeat the same high key check
+			 * before the array keys are advanced by some later tuple.
+			 */
+			pstate->highkeychecked = true;
+		}
+
+		/*
+		 * In any case, this indextuple doesn't match the qual
+		 */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).
+	 *
+	 * It might be time to advance the array keys to the next set.  Try doing
+	 * that now, while determining in passing if the tuple matches the newly
+	 * advanced set of array keys (if we've any left).
+	 *
+	 * This call will also set continuescan for us (or tells us to perform
+	 * another _bt_check_compare call, which then sets continuescan for us).
+	 */
+	if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+	{
+		/*
+		 * Tuple doesn't match any later array keys, either (for one or more
+		 * array type scan keys marked as required).  Give up on this tuple
+		 * being a match.  (Call may also have terminated the primitive scan,
+		 * or the top-level scan.)
+		 */
+		return false;
+	}
+
+	/*
+	 * Advanced array keys to values that are exact matches for corresponding
+	 * attribute values from the tuple.
+	 *
+	 * It's fairly likely that the tuple satisfies all index scan conditions
+	 * at this point, but we need confirmation of that.  We also need to give
+	 * _bt_check_compare a real opportunity to end the top-level index scan by
+	 * setting continuescan=false.  (_bt_advance_array_keys cannot deal with
+	 * inequality strategy scan keys; we need _bt_check_compare for those.)
+	 */
+	return _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+							 tuple, natts, tupdesc,
+							 &pstate->continuescan, &skrequiredtrigger);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool *continuescan, bool *skrequiredtrigger)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
 	int			ikey;
 	ScanKey		key;
 
-	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
 	*continuescan = true;		/* default assumption */
+	*skrequiredtrigger = true;	/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = keyData, ikey = 0; ikey < keysz; key++, ikey++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -1497,6 +2573,10 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * qual fails, it is critical that equality quals be used for the
 			 * initial positioning in _bt_first() when they are available. See
 			 * comments in _bt_first().
+			 *
+			 * Scans with equality-type array scan keys run into a similar
+			 * problem whenever they advance the array keys.  Our caller uses
+			 * _bt_tuple_before_array_skeys to avoid the problem there.
 			 */
 			if ((key->sk_flags & SK_BT_REQFWD) &&
 				ScanDirectionIsForward(dir))
@@ -1505,6 +2585,14 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+			{
+				if (*continuescan)
+					*skrequiredtrigger = false;
+				*continuescan = false;
+			}
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1523,7 +2611,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..f04ca1ee9 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -917,16 +876,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 					/* Caller had better intend this only for bitmap scan */
 					Assert(scantype == ST_BITMAPSCAN);
 				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..c796b53a6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				int			alength = estimate_array_length(other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6805,9 +6781,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		Selectivity btreeSelectivity;
 
 		/*
-		 * If the index is partial, AND the index predicate with the
-		 * index-bound quals to produce a more accurate idea of the number of
-		 * rows covered by the bound conditions.
+		 * AND the index predicate with the index-bound quals to produce a
+		 * more accurate idea of the number of rows covered by the bound
+		 * conditions
 		 */
 		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6832,6 +6801,43 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6845,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6858,7 +6864,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..0dde21ca2 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9b8638f28..20b69ff87 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7797,10 +7797,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..4f19fac54 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+
 SET enable_indexonlyscan = OFF;
 
 explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+
 RESET enable_indexonlyscan;
 
 --
-- 
2.40.1

#14

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Peter Geoghegan (#13)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values).

Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.

No notable changes here compared to v2.

--
Peter Geoghegan

Attachments:

v3-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/x-patch; name=v3-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 2cff1dadb7903d49a2338b64b27178fa0bc51456 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v3] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals.  This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.

The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access).  The general idea of giving the index AM this
additional context can be pushed a lot further, though.

Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime.  The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys.  This approach is
far more flexible, and can be far more efficient.  Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.

Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals.  Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute.  These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.

nbtree is now strictly guaranteed to avoid all repeat accesses to any
individual leaf page, even in cases with inequalities on high order
columns (except when the scan direction changes, or the scan restarts).
We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses.  The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans.  The added cost from repeatedly descending the
index still matters, but it can never dominate.

An important goal of this work is to remove all ScalarArrayOpExpr clause
special cases from the planner -- ScalarArrayOpExpr clauses can now be
thought of a generalization of simple equality clauses (except when
costing index scans, perhaps).  The planner no longer needs to generate
alternative index paths with filter quals/qpquals.  We assume that true
SAOP index quals are strictly better than filter/qpquals, since the work
in nbtree guarantees that they'll be at least slightly faster.

Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements.  They benefit indirectly.
The planner no longer shows any restraint around making SAOP clauses
into true nbtree index quals, which tends to result in significant
savings on heap page accesses.  In general we never need visibility
checks to evaluate true index quals, whereas filter quals often need to
perform extra heap accesses, just to eliminate non-matching tuples
(expression evaluation is only safe with known visible tuples).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   39 +-
 src/backend/access/nbtree/nbtree.c         |   65 +-
 src/backend/access/nbtree/nbtsearch.c      |   62 +-
 src/backend/access/nbtree/nbtutils.c       | 1367 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   64 +-
 src/backend/utils/adt/selfuncs.c           |  123 +-
 doc/src/sgml/monitoring.sgml               |   13 +
 src/test/regress/expected/create_index.out |   61 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   20 +-
 10 files changed, 1506 insertions(+), 313 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6345e16d7..33db9b648 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	bool		needPrimScan;	/* Perform another primitive scan? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1080,6 +1080,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the page high key.  This must happen before the
+ * first call to _bt_checkkeys.  _bt_checkkeys uses this information to manage
+ * advancement of the scan's array keys.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	highkey;		/* page high key, set by forward scans */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		highkeychecked; /* high key checked against current
+								 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1157,7 +1180,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1250,12 +1273,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6c5b5c69c..27fbb86d0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -276,7 +276,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (res)
 			break;
 		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -334,7 +334,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 			}
 		}
 		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -587,7 +589,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -613,7 +615,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -624,7 +626,17 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
+ *
+ * XXX This particular aspect of the patch is still at the proof of concept
+ * stage.  Having this much available for review at least suggests that it'll
+ * be feasible to port the existing parallel scan array scan key stuff over to
+ * using a primitive index scan counter (as opposed to an array key counter)
+ * the top-level scan.  I have yet to really put this code through its paces.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -655,16 +667,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -696,9 +709,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -732,12 +748,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -751,14 +766,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -767,13 +782,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 17ad89749..f15cd0870 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -952,6 +952,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * one we use --- by definition, they are either redundant or
 	 * contradictory.
 	 *
+	 * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+	 * used to avoid prematurely stopping the scan when an array equality qual
+	 * has its array keys advanced.
+	 *
 	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
 	 * If the index stores nulls at the end of the index we'll be starting
 	 * from, and we have no boundary key for the column (which means the key
@@ -1536,9 +1540,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
+	BTReadPageState pstate;
 	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1558,8 +1561,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	pstate.dir = dir;
+	pstate.highkey = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.highkeychecked = false;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1594,6 +1600,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY scans must provide high key up front */
+		if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1616,7 +1630,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+			if (_bt_checkkeys(scan, &pstate, itup, false))
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
@@ -1649,7 +1663,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1666,17 +1680,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
-			ItemId		iid = PageGetItemId(page, P_HIKEY);
-			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
-			int			truncatt;
+			IndexTuple	itup;
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+			if (pstate.highkey)
+				itup = pstate.highkey;
+			else
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				itup = (IndexTuple) PageGetItem(page, iid);
+			}
+
+			_bt_checkkeys(scan, &pstate, itup, true);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1697,6 +1717,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1707,12 +1728,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1724,8 +1751,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup);
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1764,7 +1790,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index e4528db47..292d2e322 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,33 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(ScanKey cur, FmgrInfo *orderproc,
+										   Datum datum, bool null,
+										   Datum arrdatum);
+static int	_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   FmgrInfo *orderproc, Datum datum, bool null,
+								   int32 *final_result);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, bool skrequiredtrigger);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool *continuescan, bool *skrequiredtrigger);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -202,6 +220,21 @@ _bt_freestack(BTStack stack)
  * array keys, it's sufficient to find the extreme element value and replace
  * the whole array with that scalar value.
  *
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved).  It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
+ * We always sort and deduplicate arrays up-front for equality array keys.
+ * ScalarArrayOpExpr execution need only visit leaf pages that might contain
+ * matches exactly once, while preserving the sort order of the index.  This
+ * isn't just about performance; it also avoids needing duplicate elimination
+ * of matching TIDs (we prefer deduplicating search keys once, up-front).
+ * Equality SK_SEARCHARRAY keys are disjuncts that we always process in
+ * index/key space order, which makes this general approach feasible.  Every
+ * index tuple will match no more than one single distinct combination of
+ * equality-constrained keys (array keys and other equality keys).
+ *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
@@ -212,6 +245,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
 	ScanKey		cur;
 	int			i;
@@ -265,6 +299,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc(nkeyatts * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -281,6 +316,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.
+		 *
+		 * XXX Clean this up some more.  This repeats some of the same work
+		 * when there are multiple scan keys for the same key column.
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+			_bt_sort_cmp_func_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -436,6 +482,42 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	return result;
 }
 
+/*
+ * Look up the appropriate comparison function in the opfamily.
+ *
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
+ */
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	Oid			elemtype;
+	RegProcedure cmp_proc;
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
+
+	/*
+	 * Determine the nominal datatype of the array elements.  We have to
+	 * support the convention that sk_subtype == InvalidOid means the opclass
+	 * input type; this is a hack to simplify life for ScanKeyInit().
+	 */
+	elemtype = skey->sk_subtype;
+	if (elemtype == InvalidOid)
+		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 rel->rd_opcintype[skey->sk_attno - 1],
+								 elemtype,
+								 BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+			 BTORDER_PROC, elemtype, elemtype,
+			 rel->rd_opfamily[skey->sk_attno - 1]);
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
 /*
  * _bt_sort_array_elements() -- sort and de-dup array elements
  *
@@ -450,42 +532,14 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 						bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -507,7 +561,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -515,6 +569,171 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This code is loosely based on _bt_compare.  However, there are some
+ * important differences.
+ *
+ * It is convenient to think of calling _bt_compare as comparing caller's
+ * insertion scankey to an index tuple.  But our callers are not searching
+ * through the index at all -- they're searching through a local array of
+ * datums associated with a scan key (using values they've taken from an index
+ * tuple).  This is a complete reversal of how things usually work, which can
+ * be confusing.
+ *
+ * Callers of this function should think of it as comparing "datum" (as well
+ * as "null") to "arrdatum".  This is the same approach that _bt_compare takes
+ * in that both functions compare the value that they're searching for to one
+ * particular item used as a binary search pivot.  (But it's the wrong way
+ * around if you think of it as "tuple values vs scan key values".  So don't.)
+*/
+static inline int32
+_bt_compare_array_skey(ScanKey cur,
+					   FmgrInfo *orderproc,
+					   Datum datum,
+					   bool null,
+					   Datum arrdatum)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur->sk_flags & SK_ISNULL)	/* array/scan key is NULL */
+	{
+		if (null)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NULL "<" NOT_NULL */
+		else
+			result = -1;		/* NULL ">" NOT_NULL */
+	}
+	else if (null)				/* array/scan key is NOT_NULL and tuple item
+								 * is NULL */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NOT_NULL ">" NULL */
+		else
+			result = 1;			/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index
+		 * tuple.  (Array scan keys cannot be cross-type, but other required
+		 * scan keys that use an equal operator can be.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 datum, arrdatum));
+
+		/*
+		 * Unlike _bt_compare, we flip the sign when column is a DESC column
+		 * (and *not* when column is ASC).  This matches the approach taken by
+		 * _bt_check_rowcompare, which performs similar three-way comparisons.
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  This allows searches against required scan key arrays to
+ * reuse the work of earlier searches, at least in many important cases.
+ * Array keys covering key space that the index scan already processed cannot
+ * possibly contain any matches.
+ *
+ * XXX There are several fairly obvious optimizations that we could apply here
+ * (e.g., precheck searches for earlier subsets of a larger array would help).
+ * Revisit this during the next round of performance validation.
+ *
+ * Returns an index to the first array element >= caller's datum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to searched-for datum.
+ */
+static int
+_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   FmgrInfo *orderproc, Datum datum, bool null,
+					   int32 *final_result)
+{
+	int			low_elem,
+				high_elem,
+				first_elem_dir,
+				result = 0;
+	bool		knownequal = false;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		first_elem_dir = 0;
+		low_elem = array->cur_elem;
+		high_elem = array->num_elems - 1;
+		if (cur_elem_start)
+			low_elem = 0;
+	}
+	else
+	{
+		first_elem_dir = array->num_elems - 1;
+		low_elem = 0;
+		high_elem = array->cur_elem;
+		if (cur_elem_start)
+		{
+			low_elem = 0;
+			high_elem = first_elem_dir;
+		}
+	}
+
+	while (high_elem > low_elem)
+	{
+		int			mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		Datum		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(cur, orderproc, datum, null, arrdatum);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * there each element is guaranteed to be unique.  We can quit as
+			 * soon as we see an equal array, saving ourselves an extra
+			 * comparison or two...
+			 */
+			low_elem = mid_elem;
+			knownequal = true;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ... but our caller also cares about the position of the searched-for
+	 * datum relative to the low_elem match we'll return.  Make sure that we
+	 * set *final_result to the result that comes from comparing low_elem's
+	 * key value to the datum that caller had us search for.
+	 */
+	if (!knownequal)
+		result = _bt_compare_array_skey(cur, orderproc, datum, null,
+										array->elem_values[low_elem]);
+
+	*final_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -539,82 +758,22 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
-}
-
-/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
- *
- * Returns true if there is another set of values to consider, false if not.
- * On true result, the scankeys are initialized with the next set of values.
- */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
-
-	/*
-	 * We must advance the last array key most quickly, since it will
-	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
-	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
-	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			cur_elem = curArrayKey->cur_elem;
-		int			num_elems = curArrayKey->num_elems;
-
-		if (ScanDirectionIsBackward(dir))
-		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-		else
-		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-
-		curArrayKey->cur_elem = cur_elem;
-		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
-
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
-
-	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
-	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
 }
 
 /*
  * _bt_mark_array_keys() -- Handle array keys during btmarkpos
  *
  * Save the current state of the array keys as the "mark" position.
+ *
+ * XXX The current set of array keys are not independent of the current scan
+ * position, so why treat them that way?
+ *
+ * We shouldn't even bother remembering the current array keys when btmarkpos
+ * is called.  The array keys should be handled lazily instead.  If and when
+ * btrestrpos is called, it can just set every array's cur_elem to the first
+ * element for the current scan direction.  When _bt_advance_array_keys is
+ * reached (during the first call to _bt_checkkeys that follows), it will
+ * automatically search for the relevant array keys using caller's tuple.
  */
 void
 _bt_mark_array_keys(IndexScanDesc scan)
@@ -661,13 +820,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
 	 * sound like overkill, but in cases with multiple keys per index column
 	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
 	 */
-	if (changed || !so->arraysStarted)
+	if (changed)
 	{
 		_bt_preprocess_keys(scan);
 		/* The mark should have been set on a consistent set of keys... */
@@ -675,6 +829,785 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	}
 }
 
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) might need to advance the scan's array
+ * keys.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfy our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it might be time for our
+ * caller to advance the array keys to the next set.
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	bool		tuple_before_array_keys = false;
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		datum;
+		bool		null,
+					skrequired;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Unlike _bt_advance_array_keys, we never deal with any non-required
+		 * array keys.  Cases where skrequiredtrigger is set to false by
+		 * _bt_check_compare should never call here.  We are only called after
+		 * _bt_check_compare provisionally indicated that the scan should be
+		 * terminated due to a _required_ scan key not being satisfied.
+		 *
+		 * We expect _bt_check_compare to notice and report required scan keys
+		 * before non-required ones.  _bt_advance_array_keys might still have
+		 * to advance non-required array keys in passing for a tuple that we
+		 * were called for, but _bt_advance_array_keys doesn't rely on us to
+		 * give it advanced notice of that.
+		 */
+		if (!skrequired)
+			break;
+
+		if (attnum > ntupatts)
+		{
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's search-type scan keys
+			 */
+			break;
+		}
+
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(cur, orderproc,
+										datum, null,
+										cur->sk_argument);
+
+		if (result != 0)
+		{
+			if (ScanDirectionIsForward(dir))
+				tuple_before_array_keys = result < 0;
+			else
+				tuple_before_array_keys = result > 0;
+
+			break;
+		}
+	}
+
+	return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if _bt_checkkeys determined that another primitive index scan
+ * must take place by calling _bt_first.  Otherwise returns false, indicating
+ * that caller's top-level scan is now past the point where further matching
+ * index tuples can be found (for the current scan direction).
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ * All other scans should just call _bt_first once, no matter what.
+ *
+ * Top-level index scans executed via multiple primitive index scans must not
+ * fail to output index tuples in the usual order for the index -- just like
+ * any other index scan would.  The state machine that manages the scan's
+ * array keys must only start primitive index scans when they cover key space
+ * strictly greater than the key space for tuples that the scan has already
+ * returned (or strictly less in the backwards scan case).  Otherwise the scan
+ * could output the same index tuples more than once, or in the wrong order.
+ *
+ * This is managed by limiting the cases that can trigger new primitive index
+ * scans to those involving required array scan keys and/or other required
+ * scan keys that use the equality strategy.  In particular, the state machine
+ * must not allow high order required scan keys using an inequality strategy
+ * (which are only required in one scan direction) to directly trigger a new
+ * primitive index scan that advances low order non-required array scan keys.
+ * For example, a query such as "SELECT thousand, tenthous FROM tenk1 WHERE
+ * thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand" whose execution
+ * involves a scan of an index on "(thousand, tenthous)" must perform no more
+ * than a single primitive index scan.  Otherwise we risk outputting tuples in
+ * the wrong order.  Array key values for the non-required scan key on the
+ * "tenthous" column must not dictate top-level scan order.  Primitive index
+ * scans mustn't scan tuples already scanned by some earlier primitive scan.
+ *
+ * In fact, nbtree makes a stronger guarantee than is strictly necessary here:
+ * it guarantees that the top-level scan won't repeat any leaf page reads.
+ * (Actually, that can still happen when the scan is repositioned, or the scan
+ * direction changes -- but that's just as true with other types of scans.)
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * opportunistically advancing the scan's array keys when it allows the
+	 * primitive index scan to find nearby matching tuples (or to eliminate
+	 * array keys with no matching tuples from further consideration).
+	 *
+	 * _bt_checkkeys sets a simple flag variable that we check here.  This
+	 * tells us if we need to perform another primitive index scan for the
+	 * now-current array keys or not.  We'll unset the flag once again to
+	 * acknowledge having started a new primitive scan (or we'll see that it
+	 * isn't set and end the top-level scan right away).
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys here.  There
+	 * are various scenarios where that won't happen.  For example, if the
+	 * index is completely empty, then _bt_first won't get as far as calling
+	 * _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys.
+	 *
+	 * There is one exception: the case where _bt_first's _bt_preprocess_keys
+	 * call determined that the scan's input scan keys can never be satisfied.
+	 * That might be true for one set of array keys, but not the next set.
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * Qual can never be satisfied.  Advance our array keys incrementally.
+		 */
+		so->needPrimScan = false;
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+	}
+
+	/* Time for another primitive index scan? */
+	if (so->needPrimScan)
+	{
+		/* Begin primitive index scan */
+		so->needPrimScan = false;
+
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Just terminate the top-level scan.
+	 */
+	_bt_advance_array_keys_to_end(scan, dir);
+
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false.  When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple.  It
+ * is okay to let the second call set pstate.continuescan=false without
+ * further intervention, since we know that it can only be for a scan key that
+ * is required in one direction.
+ *
+ * When called with skrequiredtrigger, we don't expect to have to advance any
+ * non-required scan keys.  We'll always set pstate.continuescan because a
+ * non-required scan key can never terminate the scan.
+ *
+ * Required array keys are always advanced to the highest element >= the
+ * corresponding tuple attribute values for its most significant non-equal
+ * column (or the next lowest set <= the tuple value during backwards scans).
+ * If we reach the end of the array keys for the current scan direction, we
+ * end the top-level index scan.
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys (or <= during backward
+ * scans).  This must be established first, before calling here.
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple.  We fall back on "incrementally" advancing the
+ * array keys in these cases, which involve inequality strategy scan keys.
+ * For example, with a composite index on (a, b) and a qual "WHERE a IN (3,5)
+ * AND b < 42", we'll be called for both "a" arry keys (keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42".  Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called.  The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)".  The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time.  This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys.  We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are.  We must not suppress cases where a call to
+ * _bt_check_compare sets continuescan=false for a required scan key that uses
+ * the equality strategy (only inequality-type scan keys get that treatment).
+ * We don't want to suppress the scan's termination when it's inappropriate.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, bool skrequiredtrigger)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_done = false,
+				all_skrequired_atts_wrapped = skrequiredtrigger,
+				all_atts_equal = true;
+
+	Assert(so->numberOfKeys > 0);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->qual_ok);
+
+	/*
+	 * Try to advance array keys via a series of binary searches.
+	 *
+	 * Loop iterates through the current scankeys (so->keyData, which were
+	 * output by _bt_preprocess_keys earlier) and then sets input scan keys
+	 * (so->arrayKeyData scan keys) to new array values.  This sets things up
+	 * for our call to _bt_preprocess_keys, which is where the current scan
+	 * keys actually change.
+	 *
+	 * We need to do things this way because only current/preprocessed scan
+	 * keys will be marked as required.  It's also possible that the previous
+	 * call to _bt_preprocess_keys eliminated one or more input scan keys
+	 * (possibly array type scan keys) that were deemed to be redundant.
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno,
+					first_elem_dir,
+					final_elem_dir,
+					set_elem;
+		Datum		datum;
+		bool		skrequired,
+					null;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 *
+		 * Deliberately check this both here and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * We need to check required non-array scan keys (that use the equal
+		 * strategy), as well as required and non-required array scan keys
+		 * (also limited to those that use the equal strategy, since array
+		 * inequalities degenerate into a simple comparison).
+		 *
+		 * Perform initial set up for this scan key.  If it is backed by an
+		 * array then we need to set variables describing the current position
+		 * in the array.
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		first_elem_dir = final_elem_dir = 0;		/* keep compiler quiet */
+		if (cur->sk_flags & SK_SEARCHARRAY)
+		{
+			/* Set up array comparison function */
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+
+			/*
+			 * It's possible that _bt_preprocess_keys determined that an
+			 * individual array scan key wasn't required in so->keyData for
+			 * the ongoing primitive index scan due to it being redundant or
+			 * contradictory (the current array value might be redundant next
+			 * to some other scan key on the same attribute).  Deal with that.
+			 */
+			if (unlikely(skeyarray->sk_attno != attnum))
+			{
+				bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+
+				for (; arrayidx < so->numArrayKeys; arrayidx++)
+				{
+					array = &so->arrayKeys[arrayidx];
+					skeyarray = &so->arrayKeyData[array->scan_key];
+					if (skeyarray->sk_attno == attnum)
+					{
+						found = true;
+						break;
+					}
+				}
+
+				Assert(found);
+			}
+
+			/* Proactively set up state used to handle array wraparound */
+			if (ScanDirectionIsForward(dir))
+			{
+				first_elem_dir = 0;
+				final_elem_dir = array->num_elems - 1;
+			}
+			else
+			{
+				first_elem_dir = array->num_elems - 1;
+				final_elem_dir = 0;
+			}
+		}
+		else if (attnum > ntupatts)
+		{
+			/*
+			 * Nothing needs to be done when we have a truncated attribute
+			 * (possible when caller's tuple is a page high key) and a
+			 * non-array scan key
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * non-equal required scan key.  The first scan key must have been set
+		 * to a value > the value from the tuple back when we dealt with it
+		 * (or, for a backwards scan, to a value < the value from the tuple).
+		 * That needs to "cascade" to lower-order array scan keys.  They must
+		 * be set to the first array element for the current scan direction.
+		 *
+		 * We're still setting the keys to values >= the tuple here -- it just
+		 * needs to work for the tuple as a whole.  For example, when a tuple
+		 * "(a, b) = (42, 5)" advances the array keys on "a" from 40 to 45, we
+		 * must also set "b" to whatever the first array element for "b" is.
+		 * It would be wrong to allow "b" to be set to a value from the tuple,
+		 * since the value is actually from a different part of the key space.
+		 *
+		 * Also defensively do this with truncated attributes when caller's
+		 * tuple is a page high key.
+		 */
+		if (array && ((arrays_advanced && !all_atts_equal) ||
+					  attnum > ntupatts))
+		{
+			/* Shouldn't reach this far for a non-required scan key */
+			Assert(skrequired && skrequiredtrigger && attnum > 1);
+
+			/*
+			 * We set the array to the first element (if needed) here, and we
+			 * don't unset all_required_atts_wrapped.  This array therefore
+			 * counts as a wrapped array when we go on to determine if all of
+			 * the required arrays have wrapped (after this loop).
+			 */
+			if (array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Going to compare scan key to corresponding tuple attribute value
+		 */
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		if (!array)
+		{
+			if (!skrequired || !all_atts_equal)
+				continue;
+
+			/*
+			 * This is a required non-array scan key that uses the equal
+			 * strategy.  See header comments for an explanation of why we
+			 * need to do this.
+			 */
+			result = _bt_compare_array_skey(cur, orderproc, datum, null,
+											cur->sk_argument);
+
+			if (result != 0)
+			{
+				/*
+				 * tuple attribute value is > scan key value (or < scan key
+				 * value in the backward scan case).
+				 */
+				all_atts_equal = false;
+				break;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Binary search for an array key >= the tuple value, which we'll then
+		 * set as our current array key (or <= the tuple value if this is a
+		 * backward scan).
+		 *
+		 * The binary search excludes array keys that we've already processed
+		 * from consideration, except with a non-required scan key's array.
+		 * This is not just an optimization -- it's important for correctness.
+		 * It is crucial that required array scan keys only have their array
+		 * keys advanced in the current scan direction.  We need to advance
+		 * required array keys in lock step with the index scan.
+		 *
+		 * Note in particular that arrays_advanced must only be set when the
+		 * array is advanced to a key >= the existing key, or <= for a
+		 * backwards scan.  (Though see notes about wraparound below.)
+		 */
+		set_elem = _bt_binsrch_array_skey(dir, (!skrequired || arrays_advanced),
+										  array, cur, orderproc, datum, null,
+										  &result);
+
+		/*
+		 * Maintain the state that tracks whether all attribute from the tuple
+		 * are equal to the array keys that we've set as current (or existing
+		 * array keys set during earlier calls here).
+		 */
+		if (result != 0)
+			all_atts_equal = false;
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 * Quit before setting the array keys to avoid _bt_preprocess_keys.
+		 *
+		 * Deliberately check this both before and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * If the binary search indicates that the key space for this tuple
+		 * attribute value is > the key value from the final element in the
+		 * array (final for the current scan direction), we handle it by
+		 * wrapping around to the first element of the array.
+		 *
+		 * Wrapping around simplifies advancement with a multi-column index by
+		 * allowing us to treat wrapping a column as advancing the column.  We
+		 * preserve the invariant that a required scan key's array may only be
+		 * ratcheted forward (backwards when the scan direction is backwards),
+		 * while still always being able to "advance" the array at this point.
+		 */
+		if (set_elem == final_elem_dir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+		{
+			/* Perform wraparound */
+			set_elem = first_elem_dir;
+		}
+		else if (skrequired)
+		{
+			/* Won't call _bt_advance_array_keys_to_end later */
+			all_skrequired_atts_wrapped = false;
+		}
+
+		Assert(set_elem >= 0 && set_elem < array->num_elems);
+		if (array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+
+			/*
+			 * We shouldn't have to advance a required array when called due
+			 * to _bt_check_compare determining that a non-required array
+			 * needs to be advanced.  We expect _bt_check_compare to notice
+			 * and report required scan keys before non-required ones.
+			 */
+			Assert(skrequiredtrigger || !skrequired);
+		}
+	}
+
+	if (!skrequiredtrigger)
+	{
+		/*
+		 * Failing to satisfy a non-required array scan key shouldn't ever
+		 * result in terminating the (primitive) index scan
+		 */
+	}
+	else if (all_skrequired_atts_wrapped)
+	{
+		/*
+		 * The binary searches for each tuple's attribute value in the scan
+		 * key's corresponding SK_SEARCHARRAY array all found that the tuple's
+		 * value are "past the end" of the key space covered by each array
+		 */
+		_bt_advance_array_keys_to_end(scan, dir);
+		arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+	else if (!arrays_advanced)
+	{
+		/*
+		 * We must always advance the array keys by at least one increment
+		 * (except when called to advance a non-required scan key's array).
+		 *
+		 * We need this fallback for cases where the existing array keys and
+		 * existing required equal-strategy scan keys were fully equal to the
+		 * tuple.  _bt_check_compare may have set continuescan=false due to an
+		 * inequality terminating the scan, which we don't deal with directly.
+		 * (See function's header comments for an example.)
+		 */
+		if (_bt_advance_array_keys_increment(scan, dir))
+			arrays_advanced = true;
+		else
+			arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+
+	/*
+	 * Might make sense to recheck the high key later on in cases where we
+	 * just advanced the keys (unless we were just called to advance the
+	 * scan's non-required array keys)
+	 */
+	if (arrays_advanced && skrequiredtrigger)
+		pstate->highkeychecked = false;
+
+	/*
+	 * If we changed the array keys without exhausting all array keys then we
+	 * need to preprocess our search-type scan keys once more
+	 */
+	Assert(skrequiredtrigger || !arrays_done);
+	if (arrays_advanced && !arrays_done)
+	{
+		/*
+		 * XXX Think about buffer-lock-held hazards here some more.
+		 *
+		 * In almost all interesting cases we only really need to copy over
+		 * the array values (from "so->arrayKeyData" to "so->keyData").  But
+		 * there are at least some cases where performing the full set of push
+		 * ups here (or close to it) might add value over just doing it for
+		 * the main _bt_first call.
+		 */
+		_bt_preprocess_keys(scan);
+	}
+
+	/* Are we now done with the top-level scan (barring a btrescan)? */
+	Assert(!so->needPrimScan);
+	if (!so->qual_ok)
+	{
+		/*
+		 * Increment array keys and start a new primitive index scan if
+		 * _bt_preprocess_keys() discovered that the scan keys can never be
+		 * satisfied (eg, x == 2 AND x in (1, 2, 3) for array keys 1 and 2).
+		 *
+		 * Note: There is similar handling in _bt_array_keys_remain, which
+		 * must advance the array keys without consulting us in this one case.
+		 */
+		Assert(skrequiredtrigger);
+
+		pstate->continuescan = false;
+		pstate->highkeychecked = true;
+		all_atts_equal = false; /* at least not now */
+
+		if (_bt_advance_array_keys_increment(scan, dir))
+			so->needPrimScan = true;
+	}
+	else if (!skrequiredtrigger)
+	{
+		/* Not when we failed to satisfy a non-required scan key, ever */
+		Assert(!arrays_done);
+		pstate->continuescan = true;
+	}
+	else if (arrays_done)
+	{
+		/*
+		 * Yep -- this primitive scan was our last
+		 */
+		Assert(!all_atts_equal);
+		pstate->continuescan = false;
+	}
+	else if (!all_atts_equal)
+	{
+		/*
+		 * Not done.  The top-level index scan (and primitive index scan) will
+		 * continue, since the array keys advanced.
+		 */
+		Assert(arrays_advanced);
+		pstate->continuescan = true;
+
+		/*
+		 * Some required array keys might have wrapped around during this
+		 * call, but it can't have been the most significant array scan key.
+		 */
+		Assert(!all_skrequired_atts_wrapped);
+	}
+	else
+	{
+		/*
+		 * Not done.  A second call to _bt_check_compare must now take place.
+		 * It will make the final decision on setting continuescan.
+		 */
+	}
+
+	return all_atts_equal;
+}
+
+/*
+ * Advance the array keys by a single increment in the current scan direction
+ */
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		found = false;
+	int			i;
+
+	Assert(!so->needPrimScan);
+
+	/*
+	 * We must advance the last array key most quickly, since it will
+	 * correspond to the lowest-order index column among the available
+	 * qualifications. This is necessary to ensure correct ordering of output
+	 * when there are multiple array keys.
+	 */
+	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			cur_elem = curArrayKey->cur_elem;
+		int			num_elems = curArrayKey->num_elems;
+
+		if (ScanDirectionIsBackward(dir))
+		{
+			if (--cur_elem < 0)
+			{
+				cur_elem = num_elems - 1;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+		else
+		{
+			if (++cur_elem >= num_elems)
+			{
+				cur_elem = 0;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+
+		curArrayKey->cur_elem = cur_elem;
+		skey->sk_argument = curArrayKey->elem_values[cur_elem];
+		if (found)
+			break;
+	}
+
+	return found;
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ */
+static void
+_bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(!so->needPrimScan);
+
+	for (int i = 0; i < so->numArrayKeys; i++)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			reset_elem;
+
+		if (ScanDirectionIsForward(dir))
+			reset_elem = curArrayKey->num_elems - 1;
+		else
+			reset_elem = 0;
+
+		if (curArrayKey->cur_elem != reset_elem)
+		{
+			curArrayKey->cur_elem = reset_elem;
+			skey->sk_argument = curArrayKey->elem_values[reset_elem];
+		}
+	}
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -1360,38 +2293,204 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the high
+ * key early, before we've expended too much effort on comparing tuples that
+ * cannot possibly be matches for any set of array keys.  This is just an
+ * optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate.  These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards).  Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction).  Any other order will
+ * lead to inconsistent array key state.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup)
+{
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	int			natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+	bool		skrequiredtrigger;
+
+	Assert(so->qual_ok);
+	Assert(pstate->continuescan);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+							tuple, natts, tupdesc,
+							&pstate->continuescan, &skrequiredtrigger);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.
+	 *
+	 * When there are array scan keys then we can still accept the first
+	 * answer we get from _bt_check_compare when continuescan wasn't unset.
+	 */
+	if (!so->numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare set continuescan=false in the presence of equality
+	 * type array keys.  It's possible that we haven't reached the start of
+	 * the array keys just yet.  It's also possible that we need to advance
+	 * the array keys now.  (Or perhaps we really do need to terminate the
+	 * top-level scan.)
+	 */
+	pstate->continuescan = true;	/* new initial assumption */
+
+	if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+	{
+		/*
+		 * Tuple is still < the current array scan key values (as well as
+		 * other equality type scan keys) if this is a forward scan.
+		 * (Backwards scans reach here with a tuple > equality constraints.)
+		 * We must now consider how to proceed with the ongoing primitive
+		 * index scan.
+		 *
+		 * Should _bt_readpage continue with this page for now, in the hope of
+		 * finding tuples whose key space is covered by the current array keys
+		 * before too long?  Or, should it give up and start a new primitive
+		 * index scan instead?
+		 *
+		 * Our policy is to terminate the primitive index scan at the end of
+		 * the current page if the current (most recently advanced) array keys
+		 * don't cover the final tuple from the page.  This policy is fairly
+		 * conservative.
+		 *
+		 * Note: In some cases we're effectively speculating that the next
+		 * sibling leaf page will have tuples that are covered by the key
+		 * space of our array keys (the current set or some nearby set), based
+		 * on a cue from the current page's final tuple.  There is at least a
+		 * non-zero risk of wasting a page access -- we could gamble and lose.
+		 * The details of all this are handled within _bt_advance_array_keys.
+		 */
+		if (finaltup || (!pstate->highkeychecked && pstate->highkey &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->highkey)))
+		{
+			/*
+			 * This is the final tuple (the high key for forward scans, or the
+			 * tuple at the first offset number for backward scans), but it is
+			 * still before the current array keys.  As such, we're unwilling
+			 * to allow the current primitive index scan to continue to the
+			 * next leaf page.
+			 *
+			 * Start a new primitive index scan.  The next primitive index
+			 * scan (in the next _bt_first call) is expected to reposition the
+			 * scan to some much later leaf page.  (If we had a good reason to
+			 * think that the next leaf page that will be scanned will turn
+			 * out to be close to our current position, then we wouldn't be
+			 * starting another primitive index scan.)
+			 *
+			 * Note: _bt_readpage stashes the page high key, which allows us
+			 * to make this check early (for forward scans).  We thereby avoid
+			 * scanning very many extra tuples on the page.  This is just an
+			 * optimization; skipping these useless comparisons should never
+			 * change our final conclusion about what the scan should do next.
+			 */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else if (!finaltup && pstate->highkey)
+		{
+			/*
+			 * Remember that the high key has been checked with this
+			 * particular set of array keys.
+			 *
+			 * It might make sense to check the same high key again at some
+			 * point during the ongoing _bt_readpage-wise scan of this page.
+			 * But it is definitely wasteful to repeat the same high key check
+			 * before the array keys are advanced by some later tuple.
+			 */
+			pstate->highkeychecked = true;
+		}
+
+		/*
+		 * In any case, this indextuple doesn't match the qual
+		 */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).
+	 *
+	 * It might be time to advance the array keys to the next set.  Try doing
+	 * that now, while determining in passing if the tuple matches the newly
+	 * advanced set of array keys (if we've any left).
+	 *
+	 * This call will also set continuescan for us (or tells us to perform
+	 * another _bt_check_compare call, which then sets continuescan for us).
+	 */
+	if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+	{
+		/*
+		 * Tuple doesn't match any later array keys, either (for one or more
+		 * array type scan keys marked as required).  Give up on this tuple
+		 * being a match.  (Call may have also terminated the primitive scan,
+		 * or the top-level scan.)
+		 */
+		return false;
+	}
+
+	/*
+	 * Advanced array keys to values that are exact matches for corresponding
+	 * attribute values from the tuple.
+	 *
+	 * It's fairly likely that the tuple satisfies all index scan conditions
+	 * at this point, but we need confirmation of that.  We also need to give
+	 * _bt_check_compare a real opportunity to end the top-level index scan by
+	 * setting continuescan=false.  (_bt_advance_array_keys cannot deal with
+	 * inequality strategy scan keys; we need _bt_check_compare for those.)
+	 */
+	return _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+							 tuple, natts, tupdesc,
+							 &pstate->continuescan, &skrequiredtrigger);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool *continuescan, bool *skrequiredtrigger)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
 	int			ikey;
 	ScanKey		key;
 
-	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
 	*continuescan = true;		/* default assumption */
+	*skrequiredtrigger = true;	/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = keyData, ikey = 0; ikey < keysz; key++, ikey++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -1512,6 +2611,10 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * qual fails, it is critical that equality quals be used for the
 			 * initial positioning in _bt_first() when they are available. See
 			 * comments in _bt_first().
+			 *
+			 * Scans with equality-type array scan keys run into a similar
+			 * problem whenever they advance the array keys.  Our caller uses
+			 * _bt_tuple_before_array_skeys to avoid the problem there.
 			 */
 			if ((key->sk_flags & SK_BT_REQFWD) &&
 				ScanDirectionIsForward(dir))
@@ -1520,6 +2623,14 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+			{
+				if (*continuescan)
+					*skrequiredtrigger = false;
+				*continuescan = false;
+			}
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1538,7 +2649,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..f04ca1ee9 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -917,16 +876,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 					/* Caller had better intend this only for bitmap scan */
 					Assert(scantype == ST_BITMAPSCAN);
 				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..c796b53a6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				int			alength = estimate_array_length(other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6805,9 +6781,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		Selectivity btreeSelectivity;
 
 		/*
-		 * If the index is partial, AND the index predicate with the
-		 * index-bound quals to produce a more accurate idea of the number of
-		 * rows covered by the bound conditions.
+		 * AND the index predicate with the index-bound quals to produce a
+		 * more accurate idea of the number of rows covered by the bound
+		 * conditions
 		 */
 		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6832,6 +6801,43 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6845,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6858,7 +6864,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9c4930e9a..a431a7543 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4005,6 +4005,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Only queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values are affected.  See
+    <xref linkend="functions-comparisons"/> for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9b8638f28..20b69ff87 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7797,10 +7797,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 SET enable_indexonlyscan = OFF;
 
 explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 RESET enable_indexonlyscan;
 
 --
-- 
2.40.1

#15

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Peter Geoghegan (#14)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Sep 28, 2023 at 5:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values).

Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.

Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".

Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being. The
commit in question is rather poorly documented, and it's not
immediately clear how to integrate it with my work. I just want to
make sure that there's a testable patch available.

--
Peter Geoghegan

Attachments:

v4-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v4-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 98beda9b64d9258b9886e5f1428abd69527dad2f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v4] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals.  This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.

The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access).  The general idea of giving the index AM this
additional context can be pushed a lot further, though.

Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime.  The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys.  This approach is
far more flexible, and can be far more efficient.  Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.

Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals.  Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute.  These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.

nbtree is now strictly guaranteed to avoid all repeat accesses to any
individual leaf page, even in cases with inequalities on high order
columns (except when the scan direction changes, or the scan restarts).
We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses.  The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans.  The added cost from repeatedly descending the
index still matters, but it can never dominate.

An important goal of this work is to remove all ScalarArrayOpExpr clause
special cases from the planner -- ScalarArrayOpExpr clauses can now be
thought of a generalization of simple equality clauses (except when
costing index scans, perhaps).  The planner no longer needs to generate
alternative index paths with filter quals/qpquals.  We assume that true
SAOP index quals are strictly better than filter/qpquals, since the work
in nbtree guarantees that they'll be at least slightly faster.

Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements.  They benefit indirectly.
The planner no longer shows any restraint around making SAOP clauses
into true nbtree index quals, which tends to result in significant
savings on heap page accesses.  In general we never need visibility
checks to evaluate true index quals, whereas filter quals often need to
perform extra heap accesses, just to eliminate non-matching tuples
(expression evaluation is only safe with known visible tuples).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   39 +-
 src/backend/access/nbtree/nbtree.c         |   65 +-
 src/backend/access/nbtree/nbtsearch.c      |   95 +-
 src/backend/access/nbtree/nbtutils.c       | 1386 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   64 +-
 src/backend/utils/adt/selfuncs.c           |  123 +-
 doc/src/sgml/monitoring.sgml               |   13 +
 src/test/regress/expected/create_index.out |   61 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   20 +-
 10 files changed, 1516 insertions(+), 355 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086..de7dea41c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	bool		needPrimScan;	/* Perform another primitive scan? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1083,6 +1083,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the page high key.  This must happen before the
+ * first call to _bt_checkkeys.  _bt_checkkeys uses this information to manage
+ * advancement of the scan's array keys.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	highkey;		/* page high key, set by forward scans */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		highkeychecked; /* high key checked against current
+								 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1160,7 +1183,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1253,12 +1276,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup,
 						  bool requiredMatchedByPrecheck);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 92950b377..2a463c420 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -276,7 +276,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (res)
 			break;
 		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -334,7 +334,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 			}
 		}
 		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	so->firstPage = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
@@ -588,7 +590,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -614,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -625,7 +627,17 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
+ *
+ * XXX This particular aspect of the patch is still at the proof of concept
+ * stage.  Having this much available for review at least suggests that it'll
+ * be feasible to port the existing parallel scan array scan key stuff over to
+ * using a primitive index scan counter (as opposed to an array key counter)
+ * the top-level scan.  I have yet to really put this code through its paces.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -656,16 +668,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -697,9 +710,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -733,12 +749,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -752,14 +767,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -768,13 +783,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5..a2fc9c691 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -952,6 +952,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * one we use --- by definition, they are either redundant or
 	 * contradictory.
 	 *
+	 * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+	 * used to avoid prematurely stopping the scan when an array equality qual
+	 * has its array keys advanced.
+	 *
 	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
 	 * If the index stores nulls at the end of the index we'll be starting
 	 * from, and we have no boundary key for the column (which means the key
@@ -1537,10 +1541,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
+	BTReadPageState pstate;
 	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
-	bool		requiredMatchedByPrecheck;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1560,8 +1562,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	pstate.dir = dir;
+	pstate.highkey = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.highkeychecked = false;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1613,29 +1618,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 */
 	if (!so->firstPage && minoff < maxoff)
 	{
-		ItemId		iid;
-		IndexTuple	itup;
-
-		iid = PageGetItemId(page, ScanDirectionIsForward(dir) ? maxoff : minoff);
-		itup = (IndexTuple) PageGetItem(page, iid);
-
 		/*
 		 * Do the precheck.  Note that we pass the pointer to
 		 * 'requiredMatchedByPrecheck' to 'continuescan' argument.  That will
 		 * set flag to true if all required keys are satisfied and false
 		 * otherwise.
+		 *
+		 * XXX FIXME
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &requiredMatchedByPrecheck, false);
 	}
 	else
 	{
 		so->firstPage = false;
-		requiredMatchedByPrecheck = false;
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY scans must provide high key up front */
+		if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1645,7 +1651,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	itup;
-			bool		passes_quals;
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1659,18 +1664,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
-
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
-			if (passes_quals)
+			if (_bt_checkkeys(scan, &pstate, itup, false, false))
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
@@ -1703,7 +1697,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1720,17 +1714,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
-			ItemId		iid = PageGetItemId(page, P_HIKEY);
-			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
-			int			truncatt;
+			IndexTuple	itup;
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
+			if (pstate.highkey)
+				itup = pstate.highkey;
+			else
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				itup = (IndexTuple) PageGetItem(page, iid);
+			}
+
+			_bt_checkkeys(scan, &pstate, itup, true, false);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1751,6 +1751,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1761,12 +1762,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1778,17 +1785,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
-
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup, false);
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1827,7 +1824,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1510b97fb..38d4ec463 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,34 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(ScanKey cur, FmgrInfo *orderproc,
+										   Datum datum, bool null,
+										   Datum arrdatum);
+static int	_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   FmgrInfo *orderproc, Datum datum, bool null,
+								   int32 *final_result);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, bool skrequiredtrigger);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool *continuescan, bool *skrequiredtrigger,
+							  bool requiredMatchedByPrecheck);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -202,6 +221,21 @@ _bt_freestack(BTStack stack)
  * array keys, it's sufficient to find the extreme element value and replace
  * the whole array with that scalar value.
  *
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved).  It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
+ * We always sort and deduplicate arrays up-front for equality array keys.
+ * ScalarArrayOpExpr execution need only visit leaf pages that might contain
+ * matches exactly once, while preserving the sort order of the index.  This
+ * isn't just about performance; it also avoids needing duplicate elimination
+ * of matching TIDs (we prefer deduplicating search keys once, up-front).
+ * Equality SK_SEARCHARRAY keys are disjuncts that we always process in
+ * index/key space order, which makes this general approach feasible.  Every
+ * index tuple will match no more than one single distinct combination of
+ * equality-constrained keys (array keys and other equality keys).
+ *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
@@ -212,6 +246,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
 	ScanKey		cur;
 	int			i;
@@ -265,6 +300,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc(nkeyatts * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -281,6 +317,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.
+		 *
+		 * XXX Clean this up some more.  This repeats some of the same work
+		 * when there are multiple scan keys for the same key column.
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+			_bt_sort_cmp_func_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -436,6 +483,42 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	return result;
 }
 
+/*
+ * Look up the appropriate comparison function in the opfamily.
+ *
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
+ */
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	Oid			elemtype;
+	RegProcedure cmp_proc;
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
+
+	/*
+	 * Determine the nominal datatype of the array elements.  We have to
+	 * support the convention that sk_subtype == InvalidOid means the opclass
+	 * input type; this is a hack to simplify life for ScanKeyInit().
+	 */
+	elemtype = skey->sk_subtype;
+	if (elemtype == InvalidOid)
+		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 rel->rd_opcintype[skey->sk_attno - 1],
+								 elemtype,
+								 BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+			 BTORDER_PROC, elemtype, elemtype,
+			 rel->rd_opfamily[skey->sk_attno - 1]);
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
 /*
  * _bt_sort_array_elements() -- sort and de-dup array elements
  *
@@ -450,42 +533,14 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 						bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -507,7 +562,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -515,6 +570,171 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This code is loosely based on _bt_compare.  However, there are some
+ * important differences.
+ *
+ * It is convenient to think of calling _bt_compare as comparing caller's
+ * insertion scankey to an index tuple.  But our callers are not searching
+ * through the index at all -- they're searching through a local array of
+ * datums associated with a scan key (using values they've taken from an index
+ * tuple).  This is a complete reversal of how things usually work, which can
+ * be confusing.
+ *
+ * Callers of this function should think of it as comparing "datum" (as well
+ * as "null") to "arrdatum".  This is the same approach that _bt_compare takes
+ * in that both functions compare the value that they're searching for to one
+ * particular item used as a binary search pivot.  (But it's the wrong way
+ * around if you think of it as "tuple values vs scan key values".  So don't.)
+*/
+static inline int32
+_bt_compare_array_skey(ScanKey cur,
+					   FmgrInfo *orderproc,
+					   Datum datum,
+					   bool null,
+					   Datum arrdatum)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur->sk_flags & SK_ISNULL)	/* array/scan key is NULL */
+	{
+		if (null)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NULL "<" NOT_NULL */
+		else
+			result = -1;		/* NULL ">" NOT_NULL */
+	}
+	else if (null)				/* array/scan key is NOT_NULL and tuple item
+								 * is NULL */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NOT_NULL ">" NULL */
+		else
+			result = 1;			/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index
+		 * tuple.  (Array scan keys cannot be cross-type, but other required
+		 * scan keys that use an equal operator can be.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 datum, arrdatum));
+
+		/*
+		 * Unlike _bt_compare, we flip the sign when column is a DESC column
+		 * (and *not* when column is ASC).  This matches the approach taken by
+		 * _bt_check_rowcompare, which performs similar three-way comparisons.
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  This allows searches against required scan key arrays to
+ * reuse the work of earlier searches, at least in many important cases.
+ * Array keys covering key space that the index scan already processed cannot
+ * possibly contain any matches.
+ *
+ * XXX There are several fairly obvious optimizations that we could apply here
+ * (e.g., precheck searches for earlier subsets of a larger array would help).
+ * Revisit this during the next round of performance validation.
+ *
+ * Returns an index to the first array element >= caller's datum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to searched-for datum.
+ */
+static int
+_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   FmgrInfo *orderproc, Datum datum, bool null,
+					   int32 *final_result)
+{
+	int			low_elem,
+				high_elem,
+				first_elem_dir,
+				result = 0;
+	bool		knownequal = false;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		first_elem_dir = 0;
+		low_elem = array->cur_elem;
+		high_elem = array->num_elems - 1;
+		if (cur_elem_start)
+			low_elem = 0;
+	}
+	else
+	{
+		first_elem_dir = array->num_elems - 1;
+		low_elem = 0;
+		high_elem = array->cur_elem;
+		if (cur_elem_start)
+		{
+			low_elem = 0;
+			high_elem = first_elem_dir;
+		}
+	}
+
+	while (high_elem > low_elem)
+	{
+		int			mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		Datum		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(cur, orderproc, datum, null, arrdatum);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * there each element is guaranteed to be unique.  We can quit as
+			 * soon as we see an equal array, saving ourselves an extra
+			 * comparison or two...
+			 */
+			low_elem = mid_elem;
+			knownequal = true;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ... but our caller also cares about the position of the searched-for
+	 * datum relative to the low_elem match we'll return.  Make sure that we
+	 * set *final_result to the result that comes from comparing low_elem's
+	 * key value to the datum that caller had us search for.
+	 */
+	if (!knownequal)
+		result = _bt_compare_array_skey(cur, orderproc, datum, null,
+										array->elem_values[low_elem]);
+
+	*final_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -539,82 +759,22 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
-}
-
-/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
- *
- * Returns true if there is another set of values to consider, false if not.
- * On true result, the scankeys are initialized with the next set of values.
- */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
-
-	/*
-	 * We must advance the last array key most quickly, since it will
-	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
-	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
-	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			cur_elem = curArrayKey->cur_elem;
-		int			num_elems = curArrayKey->num_elems;
-
-		if (ScanDirectionIsBackward(dir))
-		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-		else
-		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-
-		curArrayKey->cur_elem = cur_elem;
-		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
-
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
-
-	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
-	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
 }
 
 /*
  * _bt_mark_array_keys() -- Handle array keys during btmarkpos
  *
  * Save the current state of the array keys as the "mark" position.
+ *
+ * XXX The current set of array keys are not independent of the current scan
+ * position, so why treat them that way?
+ *
+ * We shouldn't even bother remembering the current array keys when btmarkpos
+ * is called.  The array keys should be handled lazily instead.  If and when
+ * btrestrpos is called, it can just set every array's cur_elem to the first
+ * element for the current scan direction.  When _bt_advance_array_keys is
+ * reached (during the first call to _bt_checkkeys that follows), it will
+ * automatically search for the relevant array keys using caller's tuple.
  */
 void
 _bt_mark_array_keys(IndexScanDesc scan)
@@ -661,13 +821,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
 	 * sound like overkill, but in cases with multiple keys per index column
 	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
 	 */
-	if (changed || !so->arraysStarted)
+	if (changed)
 	{
 		_bt_preprocess_keys(scan);
 		/* The mark should have been set on a consistent set of keys... */
@@ -675,6 +830,785 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	}
 }
 
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) might need to advance the scan's array
+ * keys.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfy our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it might be time for our
+ * caller to advance the array keys to the next set.
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	bool		tuple_before_array_keys = false;
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		datum;
+		bool		null,
+					skrequired;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Unlike _bt_advance_array_keys, we never deal with any non-required
+		 * array keys.  Cases where skrequiredtrigger is set to false by
+		 * _bt_check_compare should never call here.  We are only called after
+		 * _bt_check_compare provisionally indicated that the scan should be
+		 * terminated due to a _required_ scan key not being satisfied.
+		 *
+		 * We expect _bt_check_compare to notice and report required scan keys
+		 * before non-required ones.  _bt_advance_array_keys might still have
+		 * to advance non-required array keys in passing for a tuple that we
+		 * were called for, but _bt_advance_array_keys doesn't rely on us to
+		 * give it advanced notice of that.
+		 */
+		if (!skrequired)
+			break;
+
+		if (attnum > ntupatts)
+		{
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's search-type scan keys
+			 */
+			break;
+		}
+
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(cur, orderproc,
+										datum, null,
+										cur->sk_argument);
+
+		if (result != 0)
+		{
+			if (ScanDirectionIsForward(dir))
+				tuple_before_array_keys = result < 0;
+			else
+				tuple_before_array_keys = result > 0;
+
+			break;
+		}
+	}
+
+	return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if _bt_checkkeys determined that another primitive index scan
+ * must take place by calling _bt_first.  Otherwise returns false, indicating
+ * that caller's top-level scan is now past the point where further matching
+ * index tuples can be found (for the current scan direction).
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ * All other scans should just call _bt_first once, no matter what.
+ *
+ * Top-level index scans executed via multiple primitive index scans must not
+ * fail to output index tuples in the usual order for the index -- just like
+ * any other index scan would.  The state machine that manages the scan's
+ * array keys must only start primitive index scans when they cover key space
+ * strictly greater than the key space for tuples that the scan has already
+ * returned (or strictly less in the backwards scan case).  Otherwise the scan
+ * could output the same index tuples more than once, or in the wrong order.
+ *
+ * This is managed by limiting the cases that can trigger new primitive index
+ * scans to those involving required array scan keys and/or other required
+ * scan keys that use the equality strategy.  In particular, the state machine
+ * must not allow high order required scan keys using an inequality strategy
+ * (which are only required in one scan direction) to directly trigger a new
+ * primitive index scan that advances low order non-required array scan keys.
+ * For example, a query such as "SELECT thousand, tenthous FROM tenk1 WHERE
+ * thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand" whose execution
+ * involves a scan of an index on "(thousand, tenthous)" must perform no more
+ * than a single primitive index scan.  Otherwise we risk outputting tuples in
+ * the wrong order.  Array key values for the non-required scan key on the
+ * "tenthous" column must not dictate top-level scan order.  Primitive index
+ * scans mustn't scan tuples already scanned by some earlier primitive scan.
+ *
+ * In fact, nbtree makes a stronger guarantee than is strictly necessary here:
+ * it guarantees that the top-level scan won't repeat any leaf page reads.
+ * (Actually, that can still happen when the scan is repositioned, or the scan
+ * direction changes -- but that's just as true with other types of scans.)
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * opportunistically advancing the scan's array keys when it allows the
+	 * primitive index scan to find nearby matching tuples (or to eliminate
+	 * array keys with no matching tuples from further consideration).
+	 *
+	 * _bt_checkkeys sets a simple flag variable that we check here.  This
+	 * tells us if we need to perform another primitive index scan for the
+	 * now-current array keys or not.  We'll unset the flag once again to
+	 * acknowledge having started a new primitive scan (or we'll see that it
+	 * isn't set and end the top-level scan right away).
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys here.  There
+	 * are various scenarios where that won't happen.  For example, if the
+	 * index is completely empty, then _bt_first won't get as far as calling
+	 * _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys.
+	 *
+	 * There is one exception: the case where _bt_first's _bt_preprocess_keys
+	 * call determined that the scan's input scan keys can never be satisfied.
+	 * That might be true for one set of array keys, but not the next set.
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * Qual can never be satisfied.  Advance our array keys incrementally.
+		 */
+		so->needPrimScan = false;
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+	}
+
+	/* Time for another primitive index scan? */
+	if (so->needPrimScan)
+	{
+		/* Begin primitive index scan */
+		so->needPrimScan = false;
+
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Just terminate the top-level scan.
+	 */
+	_bt_advance_array_keys_to_end(scan, dir);
+
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false.  When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple.  It
+ * is okay to let the second call set pstate.continuescan=false without
+ * further intervention, since we know that it can only be for a scan key that
+ * is required in one direction.
+ *
+ * When called with skrequiredtrigger, we don't expect to have to advance any
+ * non-required scan keys.  We'll always set pstate.continuescan because a
+ * non-required scan key can never terminate the scan.
+ *
+ * Required array keys are always advanced to the highest element >= the
+ * corresponding tuple attribute values for its most significant non-equal
+ * column (or the next lowest set <= the tuple value during backwards scans).
+ * If we reach the end of the array keys for the current scan direction, we
+ * end the top-level index scan.
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys (or <= during backward
+ * scans).  This must be established first, before calling here.
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple.  We fall back on "incrementally" advancing the
+ * array keys in these cases, which involve inequality strategy scan keys.
+ * For example, with a composite index on (a, b) and a qual "WHERE a IN (3,5)
+ * AND b < 42", we'll be called for both "a" arry keys (keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42".  Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called.  The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)".  The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time.  This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys.  We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are.  We must not suppress cases where a call to
+ * _bt_check_compare sets continuescan=false for a required scan key that uses
+ * the equality strategy (only inequality-type scan keys get that treatment).
+ * We don't want to suppress the scan's termination when it's inappropriate.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, bool skrequiredtrigger)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_done = false,
+				all_skrequired_atts_wrapped = skrequiredtrigger,
+				all_atts_equal = true;
+
+	Assert(so->numberOfKeys > 0);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->qual_ok);
+
+	/*
+	 * Try to advance array keys via a series of binary searches.
+	 *
+	 * Loop iterates through the current scankeys (so->keyData, which were
+	 * output by _bt_preprocess_keys earlier) and then sets input scan keys
+	 * (so->arrayKeyData scan keys) to new array values.  This sets things up
+	 * for our call to _bt_preprocess_keys, which is where the current scan
+	 * keys actually change.
+	 *
+	 * We need to do things this way because only current/preprocessed scan
+	 * keys will be marked as required.  It's also possible that the previous
+	 * call to _bt_preprocess_keys eliminated one or more input scan keys
+	 * (possibly array type scan keys) that were deemed to be redundant.
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno,
+					first_elem_dir,
+					final_elem_dir,
+					set_elem;
+		Datum		datum;
+		bool		skrequired,
+					null;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 *
+		 * Deliberately check this both here and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * We need to check required non-array scan keys (that use the equal
+		 * strategy), as well as required and non-required array scan keys
+		 * (also limited to those that use the equal strategy, since array
+		 * inequalities degenerate into a simple comparison).
+		 *
+		 * Perform initial set up for this scan key.  If it is backed by an
+		 * array then we need to set variables describing the current position
+		 * in the array.
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		first_elem_dir = final_elem_dir = 0;	/* keep compiler quiet */
+		if (cur->sk_flags & SK_SEARCHARRAY)
+		{
+			/* Set up array comparison function */
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+
+			/*
+			 * It's possible that _bt_preprocess_keys determined that an
+			 * individual array scan key wasn't required in so->keyData for
+			 * the ongoing primitive index scan due to it being redundant or
+			 * contradictory (the current array value might be redundant next
+			 * to some other scan key on the same attribute).  Deal with that.
+			 */
+			if (unlikely(skeyarray->sk_attno != attnum))
+			{
+				bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+
+				for (; arrayidx < so->numArrayKeys; arrayidx++)
+				{
+					array = &so->arrayKeys[arrayidx];
+					skeyarray = &so->arrayKeyData[array->scan_key];
+					if (skeyarray->sk_attno == attnum)
+					{
+						found = true;
+						break;
+					}
+				}
+
+				Assert(found);
+			}
+
+			/* Proactively set up state used to handle array wraparound */
+			if (ScanDirectionIsForward(dir))
+			{
+				first_elem_dir = 0;
+				final_elem_dir = array->num_elems - 1;
+			}
+			else
+			{
+				first_elem_dir = array->num_elems - 1;
+				final_elem_dir = 0;
+			}
+		}
+		else if (attnum > ntupatts)
+		{
+			/*
+			 * Nothing needs to be done when we have a truncated attribute
+			 * (possible when caller's tuple is a page high key) and a
+			 * non-array scan key
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * non-equal required scan key.  The first scan key must have been set
+		 * to a value > the value from the tuple back when we dealt with it
+		 * (or, for a backwards scan, to a value < the value from the tuple).
+		 * That needs to "cascade" to lower-order array scan keys.  They must
+		 * be set to the first array element for the current scan direction.
+		 *
+		 * We're still setting the keys to values >= the tuple here -- it just
+		 * needs to work for the tuple as a whole.  For example, when a tuple
+		 * "(a, b) = (42, 5)" advances the array keys on "a" from 40 to 45, we
+		 * must also set "b" to whatever the first array element for "b" is.
+		 * It would be wrong to allow "b" to be set to a value from the tuple,
+		 * since the value is actually from a different part of the key space.
+		 *
+		 * Also defensively do this with truncated attributes when caller's
+		 * tuple is a page high key.
+		 */
+		if (array && ((arrays_advanced && !all_atts_equal) ||
+					  attnum > ntupatts))
+		{
+			/* Shouldn't reach this far for a non-required scan key */
+			Assert(skrequired && skrequiredtrigger && attnum > 1);
+
+			/*
+			 * We set the array to the first element (if needed) here, and we
+			 * don't unset all_required_atts_wrapped.  This array therefore
+			 * counts as a wrapped array when we go on to determine if all of
+			 * the required arrays have wrapped (after this loop).
+			 */
+			if (array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Going to compare scan key to corresponding tuple attribute value
+		 */
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		if (!array)
+		{
+			if (!skrequired || !all_atts_equal)
+				continue;
+
+			/*
+			 * This is a required non-array scan key that uses the equal
+			 * strategy.  See header comments for an explanation of why we
+			 * need to do this.
+			 */
+			result = _bt_compare_array_skey(cur, orderproc, datum, null,
+											cur->sk_argument);
+
+			if (result != 0)
+			{
+				/*
+				 * tuple attribute value is > scan key value (or < scan key
+				 * value in the backward scan case).
+				 */
+				all_atts_equal = false;
+				break;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Binary search for an array key >= the tuple value, which we'll then
+		 * set as our current array key (or <= the tuple value if this is a
+		 * backward scan).
+		 *
+		 * The binary search excludes array keys that we've already processed
+		 * from consideration, except with a non-required scan key's array.
+		 * This is not just an optimization -- it's important for correctness.
+		 * It is crucial that required array scan keys only have their array
+		 * keys advanced in the current scan direction.  We need to advance
+		 * required array keys in lock step with the index scan.
+		 *
+		 * Note in particular that arrays_advanced must only be set when the
+		 * array is advanced to a key >= the existing key, or <= for a
+		 * backwards scan.  (Though see notes about wraparound below.)
+		 */
+		set_elem = _bt_binsrch_array_skey(dir, (!skrequired || arrays_advanced),
+										  array, cur, orderproc, datum, null,
+										  &result);
+
+		/*
+		 * Maintain the state that tracks whether all attribute from the tuple
+		 * are equal to the array keys that we've set as current (or existing
+		 * array keys set during earlier calls here).
+		 */
+		if (result != 0)
+			all_atts_equal = false;
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 * Quit before setting the array keys to avoid _bt_preprocess_keys.
+		 *
+		 * Deliberately check this both before and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * If the binary search indicates that the key space for this tuple
+		 * attribute value is > the key value from the final element in the
+		 * array (final for the current scan direction), we handle it by
+		 * wrapping around to the first element of the array.
+		 *
+		 * Wrapping around simplifies advancement with a multi-column index by
+		 * allowing us to treat wrapping a column as advancing the column.  We
+		 * preserve the invariant that a required scan key's array may only be
+		 * ratcheted forward (backwards when the scan direction is backwards),
+		 * while still always being able to "advance" the array at this point.
+		 */
+		if (set_elem == final_elem_dir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+		{
+			/* Perform wraparound */
+			set_elem = first_elem_dir;
+		}
+		else if (skrequired)
+		{
+			/* Won't call _bt_advance_array_keys_to_end later */
+			all_skrequired_atts_wrapped = false;
+		}
+
+		Assert(set_elem >= 0 && set_elem < array->num_elems);
+		if (array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+
+			/*
+			 * We shouldn't have to advance a required array when called due
+			 * to _bt_check_compare determining that a non-required array
+			 * needs to be advanced.  We expect _bt_check_compare to notice
+			 * and report required scan keys before non-required ones.
+			 */
+			Assert(skrequiredtrigger || !skrequired);
+		}
+	}
+
+	if (!skrequiredtrigger)
+	{
+		/*
+		 * Failing to satisfy a non-required array scan key shouldn't ever
+		 * result in terminating the (primitive) index scan
+		 */
+	}
+	else if (all_skrequired_atts_wrapped)
+	{
+		/*
+		 * The binary searches for each tuple's attribute value in the scan
+		 * key's corresponding SK_SEARCHARRAY array all found that the tuple's
+		 * value are "past the end" of the key space covered by each array
+		 */
+		_bt_advance_array_keys_to_end(scan, dir);
+		arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+	else if (!arrays_advanced)
+	{
+		/*
+		 * We must always advance the array keys by at least one increment
+		 * (except when called to advance a non-required scan key's array).
+		 *
+		 * We need this fallback for cases where the existing array keys and
+		 * existing required equal-strategy scan keys were fully equal to the
+		 * tuple.  _bt_check_compare may have set continuescan=false due to an
+		 * inequality terminating the scan, which we don't deal with directly.
+		 * (See function's header comments for an example.)
+		 */
+		if (_bt_advance_array_keys_increment(scan, dir))
+			arrays_advanced = true;
+		else
+			arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+
+	/*
+	 * Might make sense to recheck the high key later on in cases where we
+	 * just advanced the keys (unless we were just called to advance the
+	 * scan's non-required array keys)
+	 */
+	if (arrays_advanced && skrequiredtrigger)
+		pstate->highkeychecked = false;
+
+	/*
+	 * If we changed the array keys without exhausting all array keys then we
+	 * need to preprocess our search-type scan keys once more
+	 */
+	Assert(skrequiredtrigger || !arrays_done);
+	if (arrays_advanced && !arrays_done)
+	{
+		/*
+		 * XXX Think about buffer-lock-held hazards here some more.
+		 *
+		 * In almost all interesting cases we only really need to copy over
+		 * the array values (from "so->arrayKeyData" to "so->keyData").  But
+		 * there are at least some cases where performing the full set of push
+		 * ups here (or close to it) might add value over just doing it for
+		 * the main _bt_first call.
+		 */
+		_bt_preprocess_keys(scan);
+	}
+
+	/* Are we now done with the top-level scan (barring a btrescan)? */
+	Assert(!so->needPrimScan);
+	if (!so->qual_ok)
+	{
+		/*
+		 * Increment array keys and start a new primitive index scan if
+		 * _bt_preprocess_keys() discovered that the scan keys can never be
+		 * satisfied (eg, x == 2 AND x in (1, 2, 3) for array keys 1 and 2).
+		 *
+		 * Note: There is similar handling in _bt_array_keys_remain, which
+		 * must advance the array keys without consulting us in this one case.
+		 */
+		Assert(skrequiredtrigger);
+
+		pstate->continuescan = false;
+		pstate->highkeychecked = true;
+		all_atts_equal = false; /* at least not now */
+
+		if (_bt_advance_array_keys_increment(scan, dir))
+			so->needPrimScan = true;
+	}
+	else if (!skrequiredtrigger)
+	{
+		/* Not when we failed to satisfy a non-required scan key, ever */
+		Assert(!arrays_done);
+		pstate->continuescan = true;
+	}
+	else if (arrays_done)
+	{
+		/*
+		 * Yep -- this primitive scan was our last
+		 */
+		Assert(!all_atts_equal);
+		pstate->continuescan = false;
+	}
+	else if (!all_atts_equal)
+	{
+		/*
+		 * Not done.  The top-level index scan (and primitive index scan) will
+		 * continue, since the array keys advanced.
+		 */
+		Assert(arrays_advanced);
+		pstate->continuescan = true;
+
+		/*
+		 * Some required array keys might have wrapped around during this
+		 * call, but it can't have been the most significant array scan key.
+		 */
+		Assert(!all_skrequired_atts_wrapped);
+	}
+	else
+	{
+		/*
+		 * Not done.  A second call to _bt_check_compare must now take place.
+		 * It will make the final decision on setting continuescan.
+		 */
+	}
+
+	return all_atts_equal;
+}
+
+/*
+ * Advance the array keys by a single increment in the current scan direction
+ */
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		found = false;
+	int			i;
+
+	Assert(!so->needPrimScan);
+
+	/*
+	 * We must advance the last array key most quickly, since it will
+	 * correspond to the lowest-order index column among the available
+	 * qualifications. This is necessary to ensure correct ordering of output
+	 * when there are multiple array keys.
+	 */
+	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			cur_elem = curArrayKey->cur_elem;
+		int			num_elems = curArrayKey->num_elems;
+
+		if (ScanDirectionIsBackward(dir))
+		{
+			if (--cur_elem < 0)
+			{
+				cur_elem = num_elems - 1;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+		else
+		{
+			if (++cur_elem >= num_elems)
+			{
+				cur_elem = 0;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+
+		curArrayKey->cur_elem = cur_elem;
+		skey->sk_argument = curArrayKey->elem_values[cur_elem];
+		if (found)
+			break;
+	}
+
+	return found;
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ */
+static void
+_bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(!so->needPrimScan);
+
+	for (int i = 0; i < so->numArrayKeys; i++)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			reset_elem;
+
+		if (ScanDirectionIsForward(dir))
+			reset_elem = curArrayKey->num_elems - 1;
+		else
+			reset_elem = 0;
+
+		if (curArrayKey->cur_elem != reset_elem)
+		{
+			curArrayKey->cur_elem = reset_elem;
+			skey->sk_argument = curArrayKey->elem_values[reset_elem];
+		}
+	}
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -1360,41 +2294,210 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the high
+ * key early, before we've expended too much effort on comparing tuples that
+ * cannot possibly be matches for any set of array keys.  This is just an
+ * optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate.  These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards).  Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction).  Any other order will
+ * lead to inconsistent array key state.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
  * requiredMatchedByPrecheck: indicates that scan keys required for
  * 							  direction scan are already matched
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup,
 			  bool requiredMatchedByPrecheck)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	int			natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+	bool		skrequiredtrigger;
+
+	Assert(so->qual_ok);
+	Assert(pstate->continuescan);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+							tuple, natts, tupdesc,
+							&pstate->continuescan, &skrequiredtrigger,
+							requiredMatchedByPrecheck);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.
+	 *
+	 * When there are array scan keys then we can still accept the first
+	 * answer we get from _bt_check_compare when continuescan wasn't unset.
+	 */
+	if (!so->numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare set continuescan=false in the presence of equality
+	 * type array keys.  It's possible that we haven't reached the start of
+	 * the array keys just yet.  It's also possible that we need to advance
+	 * the array keys now.  (Or perhaps we really do need to terminate the
+	 * top-level scan.)
+	 */
+	pstate->continuescan = true;	/* new initial assumption */
+
+	if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+	{
+		/*
+		 * Tuple is still < the current array scan key values (as well as
+		 * other equality type scan keys) if this is a forward scan.
+		 * (Backwards scans reach here with a tuple > equality constraints.)
+		 * We must now consider how to proceed with the ongoing primitive
+		 * index scan.
+		 *
+		 * Should _bt_readpage continue with this page for now, in the hope of
+		 * finding tuples whose key space is covered by the current array keys
+		 * before too long?  Or, should it give up and start a new primitive
+		 * index scan instead?
+		 *
+		 * Our policy is to terminate the primitive index scan at the end of
+		 * the current page if the current (most recently advanced) array keys
+		 * don't cover the final tuple from the page.  This policy is fairly
+		 * conservative.
+		 *
+		 * Note: In some cases we're effectively speculating that the next
+		 * sibling leaf page will have tuples that are covered by the key
+		 * space of our array keys (the current set or some nearby set), based
+		 * on a cue from the current page's final tuple.  There is at least a
+		 * non-zero risk of wasting a page access -- we could gamble and lose.
+		 * The details of all this are handled within _bt_advance_array_keys.
+		 */
+		if (finaltup || (!pstate->highkeychecked && pstate->highkey &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->highkey)))
+		{
+			/*
+			 * This is the final tuple (the high key for forward scans, or the
+			 * tuple at the first offset number for backward scans), but it is
+			 * still before the current array keys.  As such, we're unwilling
+			 * to allow the current primitive index scan to continue to the
+			 * next leaf page.
+			 *
+			 * Start a new primitive index scan.  The next primitive index
+			 * scan (in the next _bt_first call) is expected to reposition the
+			 * scan to some much later leaf page.  (If we had a good reason to
+			 * think that the next leaf page that will be scanned will turn
+			 * out to be close to our current position, then we wouldn't be
+			 * starting another primitive index scan.)
+			 *
+			 * Note: _bt_readpage stashes the page high key, which allows us
+			 * to make this check early (for forward scans).  We thereby avoid
+			 * scanning very many extra tuples on the page.  This is just an
+			 * optimization; skipping these useless comparisons should never
+			 * change our final conclusion about what the scan should do next.
+			 */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else if (!finaltup && pstate->highkey)
+		{
+			/*
+			 * Remember that the high key has been checked with this
+			 * particular set of array keys.
+			 *
+			 * It might make sense to check the same high key again at some
+			 * point during the ongoing _bt_readpage-wise scan of this page.
+			 * But it is definitely wasteful to repeat the same high key check
+			 * before the array keys are advanced by some later tuple.
+			 */
+			pstate->highkeychecked = true;
+		}
+
+		/*
+		 * In any case, this indextuple doesn't match the qual
+		 */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).
+	 *
+	 * It might be time to advance the array keys to the next set.  Try doing
+	 * that now, while determining in passing if the tuple matches the newly
+	 * advanced set of array keys (if we've any left).
+	 *
+	 * This call will also set continuescan for us (or tells us to perform
+	 * another _bt_check_compare call, which then sets continuescan for us).
+	 */
+	if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+	{
+		/*
+		 * Tuple doesn't match any later array keys, either (for one or more
+		 * array type scan keys marked as required).  Give up on this tuple
+		 * being a match.  (Call may have also terminated the primitive scan,
+		 * or the top-level scan.)
+		 */
+		return false;
+	}
+
+	/*
+	 * Advanced array keys to values that are exact matches for corresponding
+	 * attribute values from the tuple.
+	 *
+	 * It's fairly likely that the tuple satisfies all index scan conditions
+	 * at this point, but we need confirmation of that.  We also need to give
+	 * _bt_check_compare a real opportunity to end the top-level index scan by
+	 * setting continuescan=false.  (_bt_advance_array_keys cannot deal with
+	 * inequality strategy scan keys; we need _bt_check_compare for those.)
+	 */
+	return _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+							 tuple, natts, tupdesc,
+							 &pstate->continuescan, &skrequiredtrigger,
+							 requiredMatchedByPrecheck);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool *continuescan, bool *skrequiredtrigger,
+				  bool requiredMatchedByPrecheck)
+{
 	int			ikey;
 	ScanKey		key;
 
-	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
 	*continuescan = true;		/* default assumption */
+	*skrequiredtrigger = true;	/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = keyData, ikey = 0; ikey < keysz; key++, ikey++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -1525,18 +2628,11 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * opposite direction scan, it must be already satisfied by
 		 * _bt_first() except for the NULLs checking, which have already done
 		 * above.
+		 *
+		 * FIXME
 		 */
-		if (!requiredOppositeDir)
-		{
-			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-									 datum, key->sk_argument);
-		}
-		else
-		{
-			test = true;
-			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
-											 datum, key->sk_argument));
-		}
+		test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+								 datum, key->sk_argument);
 
 		if (!DatumGetBool(test))
 		{
@@ -1549,10 +2645,22 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * qual fails, it is critical that equality quals be used for the
 			 * initial positioning in _bt_first() when they are available. See
 			 * comments in _bt_first().
+			 *
+			 * Scans with equality-type array scan keys run into a similar
+			 * problem whenever they advance the array keys.  Our caller uses
+			 * _bt_tuple_before_array_skeys to avoid the problem there.
 			 */
 			if (requiredSameDir)
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+			{
+				if (*continuescan)
+					*skrequiredtrigger = false;
+				*continuescan = false;
+			}
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1571,7 +2679,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..f04ca1ee9 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -917,16 +876,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 					/* Caller had better intend this only for bitmap scan */
 					Assert(scantype == ST_BITMAPSCAN);
 				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..c796b53a6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				int			alength = estimate_array_length(other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6805,9 +6781,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		Selectivity btreeSelectivity;
 
 		/*
-		 * If the index is partial, AND the index predicate with the
-		 * index-bound quals to produce a more accurate idea of the number of
-		 * rows covered by the bound conditions.
+		 * AND the index predicate with the index-bound quals to produce a
+		 * more accurate idea of the number of rows covered by the bound
+		 * conditions
 		 */
 		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6832,6 +6801,43 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6845,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6858,7 +6864,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1149093a8..6a5068c72 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4005,6 +4005,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Only queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values are affected.  See
+    <xref linkend="functions-comparisons"/> for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index b95d30f65..25815634c 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7795,10 +7795,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 SET enable_indexonlyscan = OFF;
 
 explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 RESET enable_indexonlyscan;
 
 --
-- 
2.42.0

#16

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Peter Geoghegan (#15)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".

Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being.

Attached is v5, which deals with the conflict with the optimization
added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
optimization is now only disabled in cases without array scan keys.
(It'd be very hard to make it work with array scan keys, since an
important principle for my patch is that we can change search-type
scan keys right in the middle of any _bt_readpage() call).

v5 also fixes a longstanding open item for the patch: we no longer
call _bt_preprocess_keys() with a buffer lock held, which was a bad
idea at best, and unsafe (due to the syscache lookups within
_bt_preprocess_keys) at worst. A new, minimal version of the function
(called _bt_preprocess_keys_leafbuf) is called at the same point
instead. That change, combined with the array binary search stuff
(which was added back in v2), makes the total amount of work performed
with a buffer lock held totally reasonable in all cases. It's even
okay in extreme or adversarial cases with many millions of array keys.

Making this _bt_preprocess_keys_leafbuf approach work has a downside:
it requires that _bt_preprocess_keys be a little less aggressive about
removing redundant scan keys, in order to meet certain assumptions
held by the new _bt_preprocess_keys_leafbuf function. Essentially,
_bt_preprocess_keys must now worry about current and future array key
values when determining redundancy among scan keys -- not just the
current array key values. _bt_preprocess_keys knows nothing about
SK_SEARCHARRAY scan keys on HEAD, because on HEAD there is a strict
1:1 correspondence between the number of primitive index scans and the
number of array keys (actually, the number of distinct combinations of
array keys). Obviously that's no longer the case with the patch
(that's the whole point of the patch).

It's easiest to understand how elimination of redundant quals needs to
work in v5 by way of an example. Consider the following query:

select count(*), two, four, twenty, hundred
from
tenk1
where
two in (0, 1) and four in (1, 2, 3)
and two < 1;

Notice that "two" appears in the where clause twice. First it appears
as an SAOP, and then as an inequality. Right now, on HEAD, the
primitive index scan where the SAOP's scankey is "two = 0" renders
"two < 1" redundant. However, the subsequent primitive index scan
where "two = 1" does *not* render "two < 1" redundant. This has
implications for the mechanism in the patch, since the patch will
perform one big primitive index scan for all array constants, with
only a single _bt_preprocess_keys call at the start of its one and
only _bt_first call (but with multiple _bt_preprocess_keys_leafbuf
calls once we reach the leaf level).

The compromise that I've settled on in v5 is to teach
_bt_preprocess_keys to *never* treat "two < 1" as redundant with such
a query -- even though there is some squishy sense in which "two < 1"
is indeed still redundant (for the first SAOP key of value 0). My
approach is reasonably well targeted in that it mostly doesn't affect
queries that don't need it. But it will add cycles to some badly
written queries that wouldn't have had them in earlier Postgres
versions. I'm not entirely sure how much this matters, but my current
sense is that it doesn't matter all that much. This is the kind of
thing that is hard to test and poorly tested, so simplicity is even
more of a virtue than usual.

Note that the changes to _bt_preprocess_keys in v5 *don't* affect how
we determine if the scan has contradictory quals, which is generally
more important. With contradictory quals, _bt_first can avoid reading
any data from the index. OTOH eliminating redundant quals (i.e. the
thing that v5 *does* change) merely makes evaluating index quals less
expensive via preprocessing-away unneeded scan keys. In other words,
while it's possible that the approach taken by v5 will add CPU cycles
in a small number of cases, it should never result in more page
accesses.

--
Peter Geoghegan

Attachments:

v5-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v5-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 9e09dd71c0981048d70cce80e7b211844c1b755f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v5] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals.  This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.

The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access).  The general idea of giving the index AM this
additional context can be pushed a lot further, though.

Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime.  The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys.  This approach is
far more flexible, and can be far more efficient.  Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.

Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals.  Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute.  These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.

We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses.  The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans.  The added cost from repeatedly descending the
index still matters, but it can never be completely dominant.

Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements.  They benefit indirectly.
In general it is better to use true index quals instead of filter quals,
since it avoids extra heap accesses when eliminating non-matching tuples
via expression evaluation (in general expression evaluation is only safe
with tuples that are known visible).  The nbtree work removes what was
really an artificial downside for index quals, leaving no reason for the
planner to even consider SAOP clause index filter quals anymore.  This
is especially likely to help with selective index scans with SAOP
clauses on low-order index columns.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   39 +-
 src/backend/access/nbtree/nbtree.c         |   59 +-
 src/backend/access/nbtree/nbtsearch.c      |   84 +-
 src/backend/access/nbtree/nbtutils.c       | 1342 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   64 +-
 src/backend/utils/adt/selfuncs.c           |  123 +-
 doc/src/sgml/monitoring.sgml               |   13 +
 src/test/regress/expected/create_index.out |   61 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   20 +-
 10 files changed, 1484 insertions(+), 326 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086..de7dea41c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	bool		needPrimScan;	/* Perform another primitive scan? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1083,6 +1083,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the page high key.  This must happen before the
+ * first call to _bt_checkkeys.  _bt_checkkeys uses this information to manage
+ * advancement of the scan's array keys.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	highkey;		/* page high key, set by forward scans */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		highkeychecked; /* high key checked against current
+								 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1160,7 +1183,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1253,12 +1276,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup,
 						  bool requiredMatchedByPrecheck);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 92950b377..f963c3fe7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -276,7 +276,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (res)
 			break;
 		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -334,7 +334,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 			}
 		}
 		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	so->firstPage = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
@@ -588,7 +590,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -614,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -625,7 +627,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -656,16 +662,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -697,9 +704,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -733,12 +743,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -752,14 +761,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -768,13 +777,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5..d0abde584 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -952,6 +952,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * one we use --- by definition, they are either redundant or
 	 * contradictory.
 	 *
+	 * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+	 * used to avoid prematurely stopping the scan when an array equality qual
+	 * has its array keys advanced.
+	 *
 	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
 	 * If the index stores nulls at the end of the index we'll be starting
 	 * from, and we have no boundary key for the column (which means the key
@@ -1537,9 +1541,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
+	BTReadPageState pstate;
 	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
 	bool		requiredMatchedByPrecheck;
 
 	/*
@@ -1560,8 +1563,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	pstate.dir = dir;
+	pstate.highkey = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.highkeychecked = false;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1609,9 +1615,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 * the last item on the page would give a more precise answer.
 	 *
 	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * slowdown of the point queries.  Do the same with scans with array keys,
+	 * since that makes the optimization unsafe (our search-type scan keys can
+	 * change during any call to _bt_checkkeys whenever array keys are used).
 	 */
-	if (!so->firstPage && minoff < maxoff)
+	if (!so->firstPage && minoff < maxoff && !so->numArrayKeys)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * set flag to true if all required keys are satisfied and false
 		 * otherwise.
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &requiredMatchedByPrecheck, false);
+		_bt_checkkeys(scan, &pstate, itup, false, false);
+		requiredMatchedByPrecheck = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 	else
 	{
@@ -1636,6 +1645,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY scans must provide high key up front */
+		if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1659,8 +1676,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, false,
+										 requiredMatchedByPrecheck);
 
 			/*
 			 * If the result of prechecking required keys was true, then in
@@ -1668,8 +1685,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * result is the same.
 			 */
 			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+												 false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
@@ -1703,7 +1720,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1720,17 +1737,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
-			ItemId		iid = PageGetItemId(page, P_HIKEY);
-			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
-			int			truncatt;
+			IndexTuple	itup;
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
+			if (pstate.highkey)
+				itup = pstate.highkey;
+			else
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				itup = (IndexTuple) PageGetItem(page, iid);
+			}
+
+			_bt_checkkeys(scan, &pstate, itup, true, false);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1751,6 +1774,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1761,12 +1785,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1778,8 +1808,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+										 requiredMatchedByPrecheck);
 
 			/*
 			 * If the result of prechecking required keys was true, then in
@@ -1787,8 +1817,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * result is the same.
 			 */
 			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup,
+												 finaltup, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1827,7 +1857,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1510b97fb..7adf76e12 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,35 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(ScanKey cur, FmgrInfo *orderproc,
+										   Datum datum, bool null,
+										   Datum arrdatum);
+static int	_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   FmgrInfo *orderproc, Datum datum, bool null,
+								   int32 *final_result);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, bool skrequiredtrigger);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir);
+static void _bt_preprocess_keys_leafbuf(IndexScanDesc scan);
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool *continuescan, bool *skrequiredtrigger,
+							  bool requiredMatchedByPrecheck);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -202,6 +222,11 @@ _bt_freestack(BTStack stack)
  * array keys, it's sufficient to find the extreme element value and replace
  * the whole array with that scalar value.
  *
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved).  It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
@@ -212,6 +237,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
 	ScanKey		cur;
 	int			i;
@@ -265,6 +291,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc0(nkeyatts * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -281,6 +308,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.   Set
+		 * that up now.  (Avoids repeating work for the same attribute.)
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			!OidIsValid(so->orderProcs[cur->sk_attno - 1].fn_oid))
+			_bt_sort_cmp_func_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -436,6 +473,42 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	return result;
 }
 
+/*
+ * Look up the appropriate comparison function in the opfamily.
+ *
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
+ */
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	Oid			elemtype;
+	RegProcedure cmp_proc;
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
+
+	/*
+	 * Determine the nominal datatype of the array elements.  We have to
+	 * support the convention that sk_subtype == InvalidOid means the opclass
+	 * input type; this is a hack to simplify life for ScanKeyInit().
+	 */
+	elemtype = skey->sk_subtype;
+	if (elemtype == InvalidOid)
+		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 rel->rd_opcintype[skey->sk_attno - 1],
+								 elemtype,
+								 BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+			 BTORDER_PROC, elemtype, elemtype,
+			 rel->rd_opfamily[skey->sk_attno - 1]);
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
 /*
  * _bt_sort_array_elements() -- sort and de-dup array elements
  *
@@ -450,42 +523,14 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 						bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -507,7 +552,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -515,6 +560,167 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This code is loosely based on _bt_compare.  However, there are some
+ * important differences.
+ *
+ * It is convenient to think of calling _bt_compare as comparing caller's
+ * insertion scankey to an index tuple.  But our callers are not searching
+ * through the index at all -- they're searching through a local array of
+ * datums associated with a scan key (using values they've taken from an index
+ * tuple).  This is a complete reversal of how things usually work, which can
+ * be confusing.
+ *
+ * Callers of this function should think of it as comparing "datum" (as well
+ * as "null") to "arrdatum".  This is the same approach that _bt_compare takes
+ * in that both functions compare the value that they're searching for to one
+ * particular item used as a binary search pivot.  (But it's the wrong way
+ * around if you think of it as "tuple values vs scan key values".  So don't.)
+*/
+static inline int32
+_bt_compare_array_skey(ScanKey cur,
+					   FmgrInfo *orderproc,
+					   Datum datum,
+					   bool null,
+					   Datum arrdatum)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur->sk_flags & SK_ISNULL)	/* array/scan key is NULL */
+	{
+		if (null)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NULL "<" NOT_NULL */
+		else
+			result = -1;		/* NULL ">" NOT_NULL */
+	}
+	else if (null)				/* array/scan key is NOT_NULL and tuple item
+								 * is NULL */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NOT_NULL ">" NULL */
+		else
+			result = 1;			/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index
+		 * tuple.  (Array scan keys cannot be cross-type, but other required
+		 * scan keys that use an equal operator can be.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 datum, arrdatum));
+
+		/*
+		 * Unlike _bt_compare, we flip the sign when column is a DESC column
+		 * (and *not* when column is ASC).  This matches the approach taken by
+		 * _bt_check_rowcompare, which performs similar three-way comparisons.
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  This allows searches against required scan key arrays to
+ * reuse the work of earlier searches, at least in many important cases.
+ * Array keys covering key space that the index scan already processed cannot
+ * possibly contain any matches.
+ *
+ * Returns an index to the first array element >= caller's datum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to searched-for datum.
+ */
+static int
+_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   FmgrInfo *orderproc, Datum datum, bool null,
+					   int32 *final_result)
+{
+	int			low_elem,
+				high_elem,
+				first_elem_dir,
+				result = 0;
+	bool		knownequal = false;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		first_elem_dir = 0;
+		low_elem = array->cur_elem;
+		high_elem = array->num_elems - 1;
+		if (cur_elem_start)
+			low_elem = 0;
+	}
+	else
+	{
+		first_elem_dir = array->num_elems - 1;
+		low_elem = 0;
+		high_elem = array->cur_elem;
+		if (cur_elem_start)
+		{
+			low_elem = 0;
+			high_elem = first_elem_dir;
+		}
+	}
+
+	while (high_elem > low_elem)
+	{
+		int			mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		Datum		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(cur, orderproc, datum, null, arrdatum);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * that each element is guaranteed to be unique.  We can quit as
+			 * soon as we see an equal array, saving ourselves an extra
+			 * comparison or two...
+			 */
+			low_elem = mid_elem;
+			knownequal = true;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about the position of the searched-for
+	 * datum relative to the low_elem match we'll return.  Make sure that we
+	 * set *final_result to the result that comes from comparing low_elem's
+	 * key value to the datum that caller had us search for.
+	 */
+	if (!knownequal)
+		result = _bt_compare_array_skey(cur, orderproc, datum, null,
+										array->elem_values[low_elem]);
+
+	*final_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -539,76 +745,6 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
-}
-
-/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
- *
- * Returns true if there is another set of values to consider, false if not.
- * On true result, the scankeys are initialized with the next set of values.
- */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
-
-	/*
-	 * We must advance the last array key most quickly, since it will
-	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
-	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
-	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			cur_elem = curArrayKey->cur_elem;
-		int			num_elems = curArrayKey->num_elems;
-
-		if (ScanDirectionIsBackward(dir))
-		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-		else
-		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-
-		curArrayKey->cur_elem = cur_elem;
-		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
-
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
-
-	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
-	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
 }
 
 /*
@@ -661,13 +797,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
 	 * sound like overkill, but in cases with multiple keys per index column
 	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
 	 */
-	if (changed || !so->arraysStarted)
+	if (changed)
 	{
 		_bt_preprocess_keys(scan);
 		/* The mark should have been set on a consistent set of keys... */
@@ -675,6 +806,694 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	}
 }
 
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) might need to advance the scan's array
+ * keys.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfy our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it might be time for our
+ * caller to advance the array keys to the next set.
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	bool		tuple_before_array_keys = false;
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		datum;
+		bool		null,
+					skrequired;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Unlike _bt_advance_array_keys, we never deal with any non-required
+		 * array keys.  Cases where skrequiredtrigger is set to false by
+		 * _bt_check_compare should never call here.  We are only called after
+		 * _bt_check_compare provisionally indicated that the scan should be
+		 * terminated due to a _required_ scan key not being satisfied.
+		 *
+		 * We expect _bt_check_compare to notice and report required scan keys
+		 * before non-required ones.  _bt_advance_array_keys might still have
+		 * to advance non-required array keys in passing for a tuple that we
+		 * were called for, but _bt_advance_array_keys doesn't rely on us to
+		 * give it advanced notice of that.
+		 */
+		if (!skrequired)
+			break;
+
+		if (attnum > ntupatts)
+		{
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's search-type scan keys
+			 */
+			break;
+		}
+
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(cur, orderproc,
+										datum, null,
+										cur->sk_argument);
+
+		if (result != 0)
+		{
+			if (ScanDirectionIsForward(dir))
+				tuple_before_array_keys = result < 0;
+			else
+				tuple_before_array_keys = result > 0;
+
+			break;
+		}
+	}
+
+	return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if _bt_checkkeys determined that another primitive index scan
+ * must take place by calling _bt_first.  Otherwise returns false, indicating
+ * that caller's top-level scan is now past the point where further matching
+ * index tuples can be found (for the current scan direction).
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ * All other scans should just call _bt_first once, no matter what.
+ *
+ * Top-level index scans executed via multiple primitive index scans must not
+ * fail to output index tuples in the usual order for the index -- just like
+ * any other index scan would.  The state machine that manages the scan's
+ * array keys must only start primitive index scans when they cover key space
+ * strictly greater than the key space for tuples that the scan has already
+ * returned (or strictly less in the backwards scan case).  Otherwise the scan
+ * could output the same index tuples more than once, or in the wrong order.
+ *
+ * This is managed by limiting the cases that can trigger new primitive index
+ * scans to those involving required array scan keys and/or other required
+ * scan keys that use the equality strategy.  In particular, the state machine
+ * must not allow high order required scan keys using an inequality strategy
+ * (which are only required in one scan direction) to directly trigger a new
+ * primitive index scan that advances low order non-required array scan keys.
+ * For example, a query such as "SELECT thousand, tenthous FROM tenk1 WHERE
+ * thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand" whose execution
+ * involves a scan of an index on "(thousand, tenthous)" must perform no more
+ * than a single primitive index scan.  Otherwise we risk outputting tuples in
+ * the wrong order.  Array key values for the non-required scan key on the
+ * "tenthous" column must not dictate top-level scan order.  Primitive index
+ * scans mustn't scan tuples already scanned by some earlier primitive scan.
+ *
+ * In fact, nbtree makes a stronger guarantee than is strictly necessary here:
+ * it guarantees that the top-level scan won't repeat any leaf page reads.
+ * (Actually, that can still happen when the scan is repositioned, or the scan
+ * direction changes -- but that's just as true with other types of scans.)
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * opportunistically advancing the scan's array keys when it allows the
+	 * primitive index scan to find nearby matching tuples (or to eliminate
+	 * array keys with no matching tuples from further consideration).
+	 *
+	 * _bt_checkkeys sets a simple flag variable that we check here.  This
+	 * tells us if we need to perform another primitive index scan for the
+	 * now-current array keys or not.  We'll unset the flag once again to
+	 * acknowledge having started a new primitive scan (or we'll see that it
+	 * isn't set and end the top-level scan right away).
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys here.  There
+	 * are various scenarios where that won't happen.  For example, if the
+	 * index is completely empty, then _bt_first won't get as far as calling
+	 * _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys.
+	 *
+	 * There is one exception: the case where _bt_first's _bt_preprocess_keys
+	 * call determined that the scan's input scan keys can never be satisfied.
+	 * That might be true for one set of array keys, but not the next set.
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * Qual can never be satisfied.  Advance our array keys incrementally.
+		 */
+		so->needPrimScan = false;
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+	}
+
+	/* Time for another primitive index scan? */
+	if (so->needPrimScan)
+	{
+		/* Begin primitive index scan */
+		so->needPrimScan = false;
+
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Just terminate the top-level scan.
+	 */
+	_bt_advance_array_keys_to_end(scan, dir);
+
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false.  When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple.  It
+ * is okay to let the second call set pstate.continuescan=false without
+ * further intervention, since we know that it can only be for a scan key that
+ * is required in one direction.
+ *
+ * When called with skrequiredtrigger, we don't expect to have to advance any
+ * non-required scan keys.  We'll always set pstate.continuescan because a
+ * non-required scan key can never terminate the scan.
+ *
+ * Required array keys are always advanced to the highest element >= the
+ * corresponding tuple attribute values for its most significant non-equal
+ * column (or the next lowest set <= the tuple value during backwards scans).
+ * If we reach the end of the array keys for the current scan direction, we
+ * end the top-level index scan.
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys (or <= during backward
+ * scans).  This must be established first, before calling here.
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple.  We fall back on "incrementally" advancing the
+ * array keys in these cases, which involve inequality strategy scan keys.
+ * For example, with a composite index on (a, b) and a qual "WHERE a IN (3,5)
+ * AND b < 42", we'll be called for both "a" keys (i.e. keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42".  Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called.  The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)".  The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time.  This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys.  We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are.  We must not suppress cases where a call to
+ * _bt_check_compare sets continuescan=false for a required scan key that uses
+ * the equality strategy (only inequality-type scan keys get that treatment).
+ * We don't want to suppress the scan's termination when it's inappropriate.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, bool skrequiredtrigger)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				all_skrequired_atts_wrapped = skrequiredtrigger,
+				all_atts_equal = true,
+				arrays_done;
+
+	Assert(so->numberOfKeys > 0);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->qual_ok);
+
+	/*
+	 * Try to advance array keys via a series of binary searches.
+	 *
+	 * Loop iterates through the current scankeys (so->keyData, which were
+	 * output by _bt_preprocess_keys earlier) and then sets input scan keys
+	 * (so->arrayKeyData scan keys) to new array values.
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno,
+					first_elem_dir,
+					final_elem_dir,
+					set_elem;
+		Datum		datum;
+		bool		skrequired,
+					null;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 *
+		 * Deliberately check this both here and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * We need to check required non-array scan keys (that use the equal
+		 * strategy), as well as required and non-required array scan keys
+		 * (also limited to those that use the equal strategy, since array
+		 * inequalities degenerate into a simple comparison).
+		 *
+		 * Perform initial set up for this scan key.  If it is backed by an
+		 * array then we need to set variables describing the current position
+		 * in the array.
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		first_elem_dir = final_elem_dir = 0;	/* keep compiler quiet */
+		if (cur->sk_flags & SK_SEARCHARRAY)
+		{
+			/* Set up array comparison function */
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+			Assert(skeyarray->sk_attno == attnum);
+
+			/* Proactively set up state used to handle array wraparound */
+			if (ScanDirectionIsForward(dir))
+			{
+				first_elem_dir = 0;
+				final_elem_dir = array->num_elems - 1;
+			}
+			else
+			{
+				first_elem_dir = array->num_elems - 1;
+				final_elem_dir = 0;
+			}
+		}
+		else if (attnum > ntupatts)
+		{
+			/*
+			 * Nothing needs to be done when we have a truncated attribute
+			 * (possible when caller's tuple is a page high key) and a
+			 * non-array scan key
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * non-equal required scan key.  The first scan key must have been set
+		 * to a value > the value from the tuple back when we dealt with it
+		 * (or, for a backwards scan, to a value < the value from the tuple).
+		 * That needs to "cascade" to lower-order array scan keys.  They must
+		 * be set to the first array element for the current scan direction.
+		 *
+		 * We're still setting the keys to values >= the tuple here -- it just
+		 * needs to work for the tuple as a whole.  For example, when a tuple
+		 * "(a, b) = (42, 5)" advances the array keys on "a" from 40 to 45, we
+		 * must also set "b" to whatever the first array element for "b" is.
+		 * It would be wrong to allow "b" to be set to a value from the tuple,
+		 * since the value is actually from a different part of the key space.
+		 *
+		 * Also defensively do this with truncated attributes when caller's
+		 * tuple is a page high key.
+		 */
+		if (array && ((arrays_advanced && !all_atts_equal) ||
+					  attnum > ntupatts))
+		{
+			/*
+			 * We set the array to the first element (if needed) here, and we
+			 * don't unset all_required_atts_wrapped.  This array therefore
+			 * counts as a wrapped array when we go on to determine if all of
+			 * the required arrays have wrapped (after this loop).
+			 */
+			if (array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Going to compare scan key to corresponding tuple attribute value
+		 */
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		if (!array)
+		{
+			if (!skrequired || !all_atts_equal)
+				continue;
+
+			/*
+			 * This is a required non-array scan key that uses the equal
+			 * strategy.  See header comments for an explanation of why we
+			 * need to do this.
+			 */
+			result = _bt_compare_array_skey(cur, orderproc, datum, null,
+											cur->sk_argument);
+
+			if (result != 0)
+			{
+				/*
+				 * tuple attribute value is > scan key value (or < scan key
+				 * value in the backward scan case).
+				 */
+				all_atts_equal = false;
+				break;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Binary search for an array key >= the tuple value, which we'll then
+		 * set as our current array key (or <= the tuple value if this is a
+		 * backward scan).
+		 *
+		 * The binary search excludes array keys that we've already processed
+		 * from consideration, except with a non-required scan key's array.
+		 * This is not just an optimization -- it's important for correctness.
+		 * It is crucial that required array scan keys only have their array
+		 * keys advanced in the current scan direction.  We need to advance
+		 * required array keys in lock step with the index scan.
+		 *
+		 * Note in particular that arrays_advanced must only be set when the
+		 * array is advanced to a key >= the existing key, or <= for a
+		 * backwards scan.  (Though see notes about wraparound below.)
+		 */
+		set_elem = _bt_binsrch_array_skey(dir, (!skrequired || arrays_advanced),
+										  array, cur, orderproc, datum, null,
+										  &result);
+
+		/*
+		 * Maintain the state that tracks whether all attribute from the tuple
+		 * are equal to the array keys that we've set as current (or existing
+		 * array keys set during earlier calls here).
+		 */
+		if (result != 0)
+			all_atts_equal = false;
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 * Quit before setting the array keys to avoid _bt_preprocess_keys.
+		 *
+		 * Deliberately check this both before and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * If the binary search indicates that the key space for this tuple
+		 * attribute value is > the key value from the final element in the
+		 * array (final for the current scan direction), we handle it by
+		 * wrapping around to the first element of the array.
+		 *
+		 * Wrapping around simplifies advancement with a multi-column index by
+		 * allowing us to treat wrapping a column as advancing the column.  We
+		 * preserve the invariant that a required scan key's array may only be
+		 * ratcheted forward (backwards when the scan direction is backwards),
+		 * while still always being able to "advance" the array at this point.
+		 */
+		if (set_elem == final_elem_dir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+		{
+			/* Perform wraparound */
+			set_elem = first_elem_dir;
+		}
+		else if (skrequired)
+		{
+			/* Won't call _bt_advance_array_keys_to_end later */
+			all_skrequired_atts_wrapped = false;
+		}
+
+		Assert(set_elem >= 0 && set_elem < array->num_elems);
+		if (array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+
+			/*
+			 * We shouldn't have to advance a required array when called due
+			 * to _bt_check_compare determining that a non-required array
+			 * needs to be advanced.  We expect _bt_check_compare to notice
+			 * and report required scan keys before non-required ones.
+			 */
+			Assert(skrequiredtrigger || !skrequired);
+		}
+	}
+
+	/*
+	 * Finalize details of array key advancement
+	 */
+	arrays_done = false;
+	if (!skrequiredtrigger)
+	{
+		/*
+		 * Failing to satisfy a non-required array scan key shouldn't ever
+		 * result in terminating the (primitive) index scan
+		 */
+	}
+	else if (all_skrequired_atts_wrapped)
+	{
+		/*
+		 * The binary searches for each tuple's attribute value in the scan
+		 * key's corresponding SK_SEARCHARRAY array all found that the tuple's
+		 * value are "past the end" of the key space covered by each array
+		 */
+		_bt_advance_array_keys_to_end(scan, dir);
+		arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+	else if (!arrays_advanced)
+	{
+		/*
+		 * We must always advance the array keys by at least one increment
+		 * (except when called to advance a non-required scan key's array).
+		 *
+		 * We need this fallback for cases where the existing array keys and
+		 * existing required equal-strategy scan keys were fully equal to the
+		 * tuple.  _bt_check_compare may have set continuescan=false due to an
+		 * inequality terminating the scan, which we don't deal with directly.
+		 * (See function's header comments for an example.)
+		 */
+		if (_bt_advance_array_keys_increment(scan, dir))
+			arrays_advanced = true;
+		else
+			arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+
+	/*
+	 * If we changed the array keys (without exhausting all array keys), then
+	 * we must now perform a targeted form of in-place preprocessing of the
+	 * scan's search-type scan keys.  This updates the array scan keys in
+	 * place.  It doesn't try to eliminate redundant keys, nor can it detect
+	 * contradictory quals.
+	 */
+	if (arrays_advanced && !arrays_done)
+		_bt_preprocess_keys_leafbuf(scan);
+
+	/*
+	 * If we haven't yet exhausted all required array scan keys, the primitive
+	 * scan continues for now.  Note that the !all_atts_equal case will have
+	 * another call to _bt_check_compare right away, which will overwrite
+	 * continuescan right away.
+	 *
+	 * If any required array keys changed, it makes sense to check the high
+	 * key to terminate the scan early (the fact that it might not have worked
+	 * with previous array keys and earlier tuples tells us nothing about what
+	 * might work with new array keys and later index tuples).
+	 */
+	pstate->continuescan = !arrays_done;
+	if (arrays_advanced && skrequiredtrigger)
+		pstate->highkeychecked = false;
+
+	return all_atts_equal;
+}
+
+/*
+ * Advance the array keys by a single increment in the current scan direction
+ */
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		found = false;
+
+	Assert(!so->needPrimScan);
+
+	/*
+	 * We must advance the last array key most quickly, since it will
+	 * correspond to the lowest-order index column among the available
+	 * qualifications. This is necessary to ensure correct ordering of output
+	 * when there are multiple array keys.
+	 */
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			cur_elem = curArrayKey->cur_elem;
+		int			num_elems = curArrayKey->num_elems;
+
+		if (ScanDirectionIsBackward(dir))
+		{
+			if (--cur_elem < 0)
+			{
+				cur_elem = num_elems - 1;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+		else
+		{
+			if (++cur_elem >= num_elems)
+			{
+				cur_elem = 0;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+
+		curArrayKey->cur_elem = cur_elem;
+		skey->sk_argument = curArrayKey->elem_values[cur_elem];
+		if (found)
+			break;
+	}
+
+	return found;
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ */
+static void
+_bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(!so->needPrimScan);
+
+	for (int i = 0; i < so->numArrayKeys; i++)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			reset_elem;
+
+		if (ScanDirectionIsForward(dir))
+			reset_elem = curArrayKey->num_elems - 1;
+		else
+			reset_elem = 0;
+
+		if (curArrayKey->cur_elem != reset_elem)
+		{
+			curArrayKey->cur_elem = reset_elem;
+			skey->sk_argument = curArrayKey->elem_values[reset_elem];
+		}
+	}
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -749,6 +1568,21 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here.  They
+ * expect to be able to call _bt_preprocess_keys_leafbuf instead (a stripped
+ * down version of this function that's specialized to array key index scans).
+ * We need to be careful about that case here when we determine redundancy;
+ * equality quals must not be eliminated as redundant on the basis of array
+ * input keys that might change before another call here takes place.
+ *
+ * Note, however, that the presence of an array scan key doesn't affect how we
+ * determine if index quals are contradictory.  Contradictory qual scans move
+ * on to the next primitive index scan right away, by incrementing the scan's
+ * array keys once control reaches _bt_array_keys_remain.  There won't ever be
+ * a call to _bt_preprocess_keys_leafbuf before the next call here, so there
+ * is nothing for us to break.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -895,8 +1729,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							so->qual_ok = false;
 							return;
 						}
-						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						else if (!(eq->sk_flags & SK_SEARCHARRAY))
+						{
+							/* else discard the redundant non-equality key */
+							xform[j] = NULL;
+						}
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -994,12 +1831,28 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
+			/* yup, keep only the more restrictive non-equality key */
 			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
 										 &test_result))
 			{
 				if (test_result)
-					xform[j] = cur;
+				{
+					if (j == (BTEqualStrategyNumber - 1))
+					{
+						/*
+						 * Keep redundant = operators so that array scan keys
+						 * will always be present, as expected by our sibling
+						 * _bt_preprocess_keys_leafbuf function.
+						 */
+						ScanKey		outkey = &outkeys[new_numberOfKeys++];
+
+						memcpy(outkey, cur, sizeof(ScanKeyData));
+						if (numberOfEqualCols == attno - 1)
+							_bt_mark_scankey_required(outkey);
+					}
+					else
+						xform[j] = cur;
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1027,6 +1880,51 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+/*
+ *	_bt_preprocess_keys_leafbuf() -- Preprocess array scan keys only
+ *
+ * Stripped down version of _bt_preprocess_keys that can be called with a
+ * buffer lock held.  Reuses much of the work performed during the previous
+ * _bt_preprocess_keys call.
+ *
+ * This function just transfers newly advanced array keys that were set in
+ * "so->arrayKeyData" to corresponding "so->keyData" search-type scan keys.
+ * It does not independently detect redunant or contradictory scan keys.  This
+ * makes little difference in practice -- we rely on _bt_preprocess_keys calls
+ * from _bt_first to get most of the available benefit.
+ */
+static void
+_bt_preprocess_keys_leafbuf(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		/* Just update equality array scan keys */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		Assert(arrayidx < so->numArrayKeys);
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/*
+		 * Update the scan key's argument, but nothing more
+		 */
+		Assert(cur->sk_attno == skeyarray->sk_attno);
+		cur->sk_argument = skeyarray->sk_argument;
+	}
+
+	Assert(arrayidx == so->numArrayKeys);
+}
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1360,41 +2258,209 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the high
+ * key early, before we've expended too much effort on comparing tuples that
+ * cannot possibly be matches for any set of array keys.  This is just an
+ * optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate.  These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards).  Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction).  Any other order will
+ * lead to inconsistent array key state.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
  * requiredMatchedByPrecheck: indicates that scan keys required for
  * 							  direction scan are already matched
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup,
 			  bool requiredMatchedByPrecheck)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	int			natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+	bool		skrequiredtrigger;
+
+	Assert(so->qual_ok);
+	Assert(pstate->continuescan);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+							&pstate->continuescan, &skrequiredtrigger,
+							requiredMatchedByPrecheck);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.
+	 *
+	 * When there are array scan keys then we can still accept the first
+	 * answer we get from _bt_check_compare when continuescan wasn't unset.
+	 */
+	if (!so->numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare set continuescan=false in the presence of equality
+	 * type array keys.  It's possible that we haven't reached the start of
+	 * the array keys just yet.  It's also possible that we need to advance
+	 * the array keys now.  (Or perhaps we really do need to terminate the
+	 * top-level scan.)
+	 */
+	pstate->continuescan = true;	/* new initial assumption */
+
+	if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+	{
+		/*
+		 * Tuple is still < the current array scan key values (as well as
+		 * other equality type scan keys) if this is a forward scan.
+		 * (Backwards scans reach here with a tuple > equality constraints.)
+		 * We must now consider how to proceed with the ongoing primitive
+		 * index scan.
+		 *
+		 * Should _bt_readpage continue with this page for now, in the hope of
+		 * finding tuples whose key space is covered by the current array keys
+		 * before too long?  Or, should it give up and start a new primitive
+		 * index scan instead?
+		 *
+		 * Our policy is to terminate the primitive index scan at the end of
+		 * the current page if the current (most recently advanced) array keys
+		 * don't cover the final tuple from the page.  This policy is fairly
+		 * conservative.
+		 *
+		 * Note: In some cases we're effectively speculating that the next
+		 * sibling leaf page will have tuples that are covered by the key
+		 * space of our array keys (the current set or some nearby set), based
+		 * on a cue from the current page's final tuple.  There is at least a
+		 * non-zero risk of wasting a page access -- we could gamble and lose.
+		 * The details of all this are handled within _bt_advance_array_keys.
+		 */
+		if (finaltup || (!pstate->highkeychecked && pstate->highkey &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->highkey)))
+		{
+			/*
+			 * This is the final tuple (the high key for forward scans, or the
+			 * tuple at the first offset number for backward scans), but it is
+			 * still before the current array keys.  As such, we're unwilling
+			 * to allow the current primitive index scan to continue to the
+			 * next leaf page.
+			 *
+			 * Start a new primitive index scan.  The next primitive index
+			 * scan (in the next _bt_first call) is expected to reposition the
+			 * scan to some much later leaf page.  (If we had a good reason to
+			 * think that the next leaf page that will be scanned will turn
+			 * out to be close to our current position, then we wouldn't be
+			 * starting another primitive index scan.)
+			 *
+			 * Note: _bt_readpage stashes the page high key, which allows us
+			 * to make this check early (for forward scans).  We thereby avoid
+			 * scanning very many extra tuples on the page.  This is just an
+			 * optimization; skipping these useless comparisons should never
+			 * change our final conclusion about what the scan should do next.
+			 */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else if (!finaltup && pstate->highkey)
+		{
+			/*
+			 * Remember that the high key has been checked with this
+			 * particular set of array keys.
+			 *
+			 * It might make sense to check the same high key again at some
+			 * point during the ongoing _bt_readpage-wise scan of this page.
+			 * But it is definitely wasteful to repeat the same high key check
+			 * before the array keys are advanced by some later tuple.
+			 */
+			pstate->highkeychecked = true;
+		}
+
+		/*
+		 * In any case, this indextuple doesn't match the qual
+		 */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).
+	 *
+	 * It might be time to advance the array keys to the next set.  Try doing
+	 * that now, while determining in passing if the tuple matches the newly
+	 * advanced set of array keys (if we've any left).
+	 *
+	 * This call will also set continuescan for us (or tells us to perform
+	 * another _bt_check_compare call, which then sets continuescan for us).
+	 */
+	if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+	{
+		/*
+		 * Tuple doesn't match any later array keys, either.  Give up on this
+		 * tuple being a match.  (Call may have also terminated the primitive
+		 * scan, or the top-level scan.)
+		 */
+		return false;
+	}
+
+	/*
+	 * Advanced array keys to values that are exact matches for corresponding
+	 * attribute values from the tuple.
+	 *
+	 * It's fairly likely that the tuple satisfies all index scan conditions
+	 * at this point, but we need confirmation of that.  We also need to give
+	 * _bt_check_compare a real opportunity to end the top-level index scan by
+	 * setting continuescan=false.  (_bt_advance_array_keys cannot deal with
+	 * inequality strategy scan keys; we need _bt_check_compare for those.)
+	 */
+	return _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+							 &pstate->continuescan, &skrequiredtrigger,
+							 false);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool *continuescan, bool *skrequiredtrigger,
+				  bool requiredMatchedByPrecheck)
+{
 	int			ikey;
 	ScanKey		key;
 
-	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!so->numArrayKeys || !requiredMatchedByPrecheck);
 
 	*continuescan = true;		/* default assumption */
+	*skrequiredtrigger = true;	/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = so->keyData, ikey = 0; ikey < so->numberOfKeys; key++, ikey++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -1526,7 +2592,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * _bt_first() except for the NULLs checking, which have already done
 		 * above.
 		 */
-		if (!requiredOppositeDir)
+		if (!requiredOppositeDir || so->numArrayKeys)
 		{
 			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
 									 datum, key->sk_argument);
@@ -1549,10 +2615,22 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * qual fails, it is critical that equality quals be used for the
 			 * initial positioning in _bt_first() when they are available. See
 			 * comments in _bt_first().
+			 *
+			 * Scans with equality-type array scan keys run into a similar
+			 * problem whenever they advance the array keys.  Our caller uses
+			 * _bt_tuple_before_array_skeys to avoid the problem there.
 			 */
 			if (requiredSameDir)
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+			{
+				if (!requiredSameDir)
+					*skrequiredtrigger = false;
+				*continuescan = false;
+			}
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1571,7 +2649,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..f04ca1ee9 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -917,16 +876,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 					/* Caller had better intend this only for bitmap scan */
 					Assert(scantype == ST_BITMAPSCAN);
 				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..c796b53a6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				int			alength = estimate_array_length(other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6805,9 +6781,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		Selectivity btreeSelectivity;
 
 		/*
-		 * If the index is partial, AND the index predicate with the
-		 * index-bound quals to produce a more accurate idea of the number of
-		 * rows covered by the bound conditions.
+		 * AND the index predicate with the index-bound quals to produce a
+		 * more accurate idea of the number of rows covered by the bound
+		 * conditions
 		 */
 		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6832,6 +6801,43 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6845,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6858,7 +6864,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1149093a8..6a5068c72 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4005,6 +4005,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Only queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values are affected.  See
+    <xref linkend="functions-comparisons"/> for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index b95d30f65..25815634c 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7795,10 +7795,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 SET enable_indexonlyscan = OFF;
 
 explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 RESET enable_indexonlyscan;
 
 --
-- 
2.42.0

#17

Matthias van de Meent

boekewurm+postgres@gmail.com

about 2 years ago

In reply to: Peter Geoghegan (#16)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sat, 21 Oct 2023 at 00:40, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".

Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being.

Attached is v5, which deals with the conflict with the optimization
added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
optimization is now only disabled in cases without array scan keys.
(It'd be very hard to make it work with array scan keys, since an
important principle for my patch is that we can change search-type
scan keys right in the middle of any _bt_readpage() call).

I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?
If not, I'll probably be able to discover my own understanding of the
mechanism used in the patch, but if there is a framework to build that
understanding on (rather than having to build it from scratch) that'd
be greatly appreciated.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#18

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Matthias van de Meent (#17)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?

This is an excellent question. You're entirely right: there isn't
enough information about the design of the state machine.

In v1 of the patch, from all the way back in July, the "state machine"
advanced in the hackiest way possible: via repeated "incremental"
advancement (using logic from the function that we call
_bt_advance_array_keys() on HEAD) in a loop -- we just kept doing that
until the function I'm now calling _bt_tuple_before_array_skeys()
eventually reported that the array keys were now sufficiently
advanced. v2 greatly improved matters by totally overhauling
_bt_advance_array_keys(): it was taught to use binary searches to
advance the array keys, with limited remaining use of "incremental"
array key advancement.

However, version 2 (and all later versions to date) have somewhat
wonky state machine transitions, in one important respect: calls to
the new _bt_advance_array_keys() won't always advance the array keys
to the maximum extent possible (possible while still getting correct
behavior, that is). There were still various complicated scenarios
involving multiple "required" array keys (SK_BT_REQFWD + SK_BT_REQBKWD
scan keys that use BTEqualStrategyNumber), where one single call to
_bt_advance_array_keys() would advance the array keys to a point that
was still < caller's tuple. AFAICT this didn't cause wrong answers to
queries (that would require failing to find a set of exactly matching
array keys where a matching set exists), but it was kludgey. It was
sloppy in roughly the same way as the approach in my v1 prototype was
sloppy (just to a lesser degree).

I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review. The upcoming version will have simple
preconditions and postconditions for the function that advances the
array key state machine (the new _bt_advance_array_keys). These are
enforced by assertions at the start and end of the function. So the
rules for the state machine become crystal clear and fairly easy to
keep in your head (e.g., tuple must be >= required array keys on entry
and <= required array keys on exit, the array keys must always either
advance by one increment or be completely exhausted for the top-level
scan in the current scan direction).

Unsurprisingly, I found that adding and enforcing these invariants led
to a simpler and more general design within _bt_advance_array_keys.
That code is still the most complicated part of the patch, but it's
much less of a bag of tricks. Another reason for you to hold off for a
few more days.

--
Peter Geoghegan

#19

Matthias van de Meent

boekewurm+postgres@gmail.com

about 2 years ago

In reply to: Peter Geoghegan (#18)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?

This is an excellent question. You're entirely right: there isn't
enough information about the design of the state machine.

I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review.

Okay, thanks for the update, then I'll wait for v6 to be posted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#20

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Matthias van de Meent (#19)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:

I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review.

Okay, thanks for the update, then I'll wait for v6 to be posted.

On second thought, I'll just post v6 now (there won't be conflicts
against the master branch once the other patch is committed anyway).

Highlights:

* Major simplifications to the array key state machine, already
described by my recent email.

* Added preprocessing of "redundant and contradictory" array elements
to _bt_preprocess_array_keys().

This makes the special preprocessing pass just for array keys
("preprocessing preprocessing") within _bt_preprocess_array_keys()
make this query into a no-op:

select * from tab where a in (180, 345) and a in (230, 300); -- contradictory

Similarly, it can make this query only attempt one single primitive
index scan for "230":

select * from tab where a in (180, 230) and a in (230, 300); -- has
redundancies, plus some individual elements contradict each other

This duplicates some of what _bt_preprocess_keys can do already. But
_bt_preprocess_keys can only do this stuff at the level of individual
array elements/primitive index scans. Whereas this works "one level
up", allowing preprocessing to see the full picture rather than just
seeing the start of one particular primitive index scan. It explicitly
works across array keys, saving repeat work inside
_bt_preprocess_keys. That could really add up with thousands of array
keys and/or multiple SAOPs. (Note that _bt_preprocess_array_keys
already does something like this, to deal with SAOP inequalities such
as "WHERE my_col >= any (array[1, 2])" -- it's a little surprising
that this obvious optimization wasn't part of the original nbtree SAOP
patch.)

This reminds me: you might want to try breaking the patch by coming up
with adversarial cases, Matthias. The patch needs to be able to deal
with absurdly large amounts of array keys reasonably well, because it
proposes to normalize passing those to the nbtree code. It's
especially important that the patch never takes too much time to do
something (e.g., binary searching through array keys) while holding a
buffer lock -- even with very silly adversarial queries.

So, for example, queries like this one (specifically designed to
stress the implementation) *need* to work reasonably well:

with a as (
select i from generate_series(0, 500000) i
)
select
count(*), thousand, tenthous
from
tenk1
where
thousand = any (array[(select array_agg(i) from a)]) and
tenthous = any (array[(select array_agg(i) from a)])
group by
thousand, tenthous
order by
thousand, tenthous;

(You can run this yourself after the regression tests finish, of course.)

This takes about 130ms on my machine, hardly any of which takes place
in the nbtree code with the patch (think tens of microseconds per
_bt_readpage call, at most) -- the plan is an index-only scan that
gets only 30 buffer hits. On the master branch, it's vastly slower --
1000025 buffer hits. The query as a whole takes about 3 seconds there.

If you have 3 or 4 SOAPs (with a composite index that has as many
columns) you can quite easily DOS the master branch, since the planner
makes a generic assumption that each of these SOAPs will have only 10
elements. The planner also thinks that with the patch applied, with
one important difference: it doesn't matter to nbtree. The cost of
scanning each index page should be practically independent of the
total size of each array, at least past a certain point. Similarly,
the maximum cost of an index scan should be approximately fixed: it
should be capped at the cost of a full index scan (with the added cost
of these relatively expensive quals still capped, still essentially
independent of array sizes past some point).

I notice that if I remove the "thousand = any (array[(select
array_agg(i) from a)]) and" line from the adversarial query, executing
the resulting query still get 30 buffer hits with the patch -- though
it only takes 90ms this time (it's faster for reasons that likely have
less than you'd think to do with nbtree overheads). This is just
another way of getting roughly the same full index scan. That's a
completely new way of thinking about nbtree SAOPs from a planner
perspective (also from a user's perspective, I suppose).

It's important that the planner's new optimistic assumptions about the
cost profile of SOAPS (that it can expect reasonable
performance/access patterns with wildly unreasonable/huge/arbitrarily
complicated SAOPs) always be met by nbtree -- no repeat index page
accesses, no holding a buffer lock for more than (say) a small
fraction of 1 millisecond (no matter the complexity of the query), and
possibly other things I haven't thought of yet.

If you end up finding a bug in this v6, it'll most likely be a case
where nbtree fails to live up to that. This project is as much about
robust/predictable performance as anything else -- nbtree needs to be
able to cope with practically anything. I suggest that your review
start by trying to break the patch along these lines.

--
Peter Geoghegan

Attachments:

v6-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v6-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 1ae97d26aa5a1fb3e7dafc4160960bc144e4be9e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v6] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals.  This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.

The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access).  The general idea of giving the index AM this
additional context can be pushed a lot further, though.

Teach nbtree SAOP index scans to advance array scan keys by applying
information about the physical characteristics of the index at runtime.
The array key state machine advances the current array keys using the
next index tuple in line to be scanned, at the point where the scan
reaches the end of index tuples matching its current array keys.  We
dynamically decide whether to perform another primitive index scan (or
whether to stick with the ongoing leaf level traversal) using a set of
heuristics that aim to minimize repeat index descents.  This approach
can be far more efficient: many cases that previously required thousands
of primitive index scans now require as few as one single primitive
index scan.  All duplicative index page accesses are now avoided.

nbtree can now execute required and non-required array/SAOP scan keys in
the most efficient way possible.  Naturally, only required SAOP keys
(i.e. those that can terminate the top-level scan) are capable of
triggering a new primitive index scan; non-required keys never affect
the scan's position.  Consequently, index scans on a composite index
with (say) a high-order inequality key and a low-order SAOP key (which
nbtree will make into a non-required scan key) will now reliably output
rows in index order.  The scan is always executed as one large index
scan under the hood, which is obviously the fastest way to do it, for
the usual reasons: it avoids useless repeat index page accesses across
successive primitive index scans.  More importantly, nbtree's very
general approach removes any question of index scan nodes outputting
rows in an order that doesn't match the index.  This enables the removal
of various special cases from the planner -- which in turn makes the
nbtree enhancements more effective and more widely applicable.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without low-order
ScalarArrayOpExpr quals (paths where the quals appear as filter quals
instead).  Now there is never any need to make a cost-based choice
between an index scan that can be trusted to return tuples in index
order (but has SAOP filter quals), and a more selective index scan that
can apply true SAOP index quals for one or more low-order index columns
(but cannot be trusted to produce tuples in index order).

Many of the queries sped up by the enhancements added by this commit
won't benefit much from avoiding repeat index page accesses.  The most
compelling cases are those where query execution _completely_ avoids
many heap page accesses that filter quals would have otherwise required,
just to eliminate one or more non-matching rows from each heap page.
(In general, index scan filter quals always need "extra" heap accesses
to eliminate non-matching rows, since expression evaluation is only
deemed safe with visible rows.  Whereas index quals never require inline
visibility checks; they can just eliminate non-matching rows up front.)

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   42 +-
 src/backend/access/nbtree/nbtree.c         |   63 +-
 src/backend/access/nbtree/nbtsearch.c      |   92 +-
 src/backend/access/nbtree/nbtutils.c       | 1472 +++++++++++++++++++-
 src/backend/optimizer/path/indxpath.c      |   86 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   13 +
 src/test/regress/expected/create_index.out |   61 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   20 +-
 10 files changed, 1700 insertions(+), 276 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086..566e1c15d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -965,7 +965,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	bool		needPrimScan;	/* Perform another primitive scan? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1083,6 +1083,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the final tuple from the page.  This must happen
+ * before the first call to _bt_checkkeys.  _bt_checkkeys uses the final tuple
+ * to manage advancement of the scan's array keys more efficiently.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* final tuple (high key for forward scans) */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		finaltupchecked;	/* final tuple checked against current
+									 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1090,6 +1113,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_RDDNARRAY	0x00040000	/* redundant in array preprocessing */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1160,7 +1184,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1253,12 +1277,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup,
 						  bool requiredMatchedByPrecheck);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a88b36a58..6328a8a63 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -275,8 +275,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -333,8 +333,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	so->firstPage = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
@@ -588,7 +590,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -614,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -625,7 +627,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -656,16 +662,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -697,9 +704,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -733,12 +743,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -752,14 +761,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -768,13 +777,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5..b2addd714 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -952,6 +952,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * one we use --- by definition, they are either redundant or
 	 * contradictory.
 	 *
+	 * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+	 * used to avoid prematurely stopping the scan when an array equality qual
+	 * has its array keys advanced.
+	 *
 	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
 	 * If the index stores nulls at the end of the index we'll be starting
 	 * from, and we have no boundary key for the column (which means the key
@@ -1537,9 +1541,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
+	BTReadPageState pstate;
 	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
 	bool		requiredMatchedByPrecheck;
 
 	/*
@@ -1560,8 +1563,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.finaltupchecked = false;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1609,9 +1615,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 * the last item on the page would give a more precise answer.
 	 *
 	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * slowdown of the point queries.  Do the same with scans with array keys,
+	 * since that makes the optimization unsafe (our search-type scan keys can
+	 * change during any call to _bt_checkkeys whenever array keys are used).
 	 */
-	if (!so->firstPage && minoff < maxoff)
+	if (!so->firstPage && minoff < maxoff && !so->numArrayKeys)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * set flag to true if all required keys are satisfied and false
 		 * otherwise.
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &requiredMatchedByPrecheck, false);
+		_bt_checkkeys(scan, &pstate, itup, false, false);
+		requiredMatchedByPrecheck = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 	else
 	{
@@ -1636,6 +1645,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1659,8 +1676,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, false,
+										 requiredMatchedByPrecheck);
 
 			/*
 			 * If the result of prechecking required keys was true, then in
@@ -1668,8 +1685,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * result is the same.
 			 */
 			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+												 false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
@@ -1703,7 +1720,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1720,17 +1737,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
-			ItemId		iid = PageGetItemId(page, P_HIKEY);
-			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
-			int			truncatt;
+			IndexTuple	itup;
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
+			if (pstate.finaltup)
+				itup = pstate.finaltup;
+			else
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				itup = (IndexTuple) PageGetItem(page, iid);
+			}
+
+			_bt_checkkeys(scan, &pstate, itup, true, false);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1740,6 +1763,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (so->numArrayKeys && minoff < maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1751,6 +1782,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1761,12 +1793,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1778,8 +1816,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+										 requiredMatchedByPrecheck);
 
 			/*
 			 * If the result of prechecking required keys was true, then in
@@ -1787,8 +1825,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * result is the same.
 			 */
 			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup,
+												 finaltup, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1827,7 +1865,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1510b97fb..8318e6250 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,41 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
+static int	_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *final_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, bool skrequiredtrigger);
+static void _bt_preprocess_keys_leafbuf(IndexScanDesc scan);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_array_scankeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool *continuescan, bool *skrequiredtrigger,
+							  bool requiredMatchedByPrecheck);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -198,13 +224,48 @@ _bt_freestack(BTStack stack)
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
  * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * array elements.
+ *
+ * _bt_preprocess_keys treats each primitive scan as an independent piece of
+ * work.  That structure pushes the responsibility for preprocessing that must
+ * work "across array keys" onto us.  This division of labor makes sense once
+ * you consider that we're typically called no more than once per btrescan,
+ * whereas _bt_preprocess_keys is always called once per primitive index scan.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array key as redundant.  Similarly, we are capable of "merging
+ * together" multiple equality array keys from two or more input scan keys
+ * into a single output scan key that contains only the intersecting array
+ * elements.  This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.
+ *
+ * Note: _bt_start_array_keys actually sets up the cur_elem counters later on,
+ * once the scan direction is known.
  *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
+ *
+ * Note: _bt_preprocess_keys is responsible for creating the so->keyData scan
+ * keys used by _bt_checkkeys.  Index scans that don't use equality array keys
+ * will have _bt_preprocess_keys treat scan->keyData as input and so->keyData
+ * as output.  Scans that use equality array keys have _bt_preprocess_keys
+ * treat so->arrayKeyData (which is our output) as their input, while (as per
+ * usual) outputting so->keyData for _bt_checkkeys.  This function adds an
+ * additional layer of indirection that allows _bt_preprocess_keys to more or
+ * less avoid dealing with SK_SEARCHARRAY as a special case.
+ *
+ * Note: _bt_preprocess_keys_leafbuf works by updating already-processed
+ * output keys (so->keyData) in-place.  It cannot eliminate redundant or
+ * contradictory scan keys.  This necessitates having _bt_preprocess_keys
+ * understand that it is unsafe to eliminate "redundant" SK_SEARCHARRAY
+ * equality scan keys on the basis of what is actually just the current array
+ * key values -- it must conservatively assume that such a scan key might no
+ * longer be redundant after the next _bt_preprocess_keys_leafbuf call.
+ * Ideally we'd be able to deal with that by eliminating a subset of truly
+ * redundant array keys up-front, but it doesn't seem worth the trouble.
  */
 void
 _bt_preprocess_array_keys(IndexScanDesc scan)
@@ -212,7 +273,9 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
+	int			lastEqualityArrayAtt = -1;
 	ScanKey		cur;
 	int			i;
 	MemoryContext oldContext;
@@ -265,6 +328,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc0(nkeyatts * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -281,6 +345,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.   Set
+		 * that up now.  (Avoids repeating work for the same attribute.)
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			!OidIsValid(so->orderProcs[cur->sk_attno - 1].fn_oid))
+			_bt_sort_cmp_func_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and redundant whole scan keys)
+		 */
+		if (lastEqualityArrayAtt == cur->sk_attno)
+		{
+			BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+			Assert(so->arrayKeyData[prev->scan_key].sk_func.fn_oid ==
+				   cur->sk_func.fn_oid);
+			Assert(so->arrayKeyData[prev->scan_key].sk_subtype ==
+				   cur->sk_subtype);
+
+			/* We could pfree(elem_values) after, but not worth the cycles */
+			num_elems = _bt_merge_arrays(scan, cur,
+										 (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+										 prev->elem_values, prev->num_elems,
+										 elem_values, num_elems);
+
+			/*
+			 * If there are no intersecting elements left from merging this
+			 * array into the previous array on the same attribute, the scan
+			 * qual is unsatisfiable
+			 */
+			if (num_elems == 0)
+			{
+				numArrayKeys = -1;
+				break;
+			}
+
+			/*
+			 * Lower the number of elements from the previous array, and mark
+			 * this scan key/array as redundant for every primitive index scan
+			 */
+			prev->num_elems = num_elems;
+			cur->sk_flags |= SK_BT_RDDNARRAY;
+			continue;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
 		 */
@@ -364,6 +478,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		lastEqualityArrayAtt = cur->sk_attno;
 	}
 
 	so->numArrayKeys = numArrayKeys;
@@ -437,26 +552,20 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 }
 
 /*
- * _bt_sort_array_elements() -- sort and de-dup array elements
+ * Look up the appropriate comparison function in the opfamily.
  *
- * The array elements are sorted in-place, and the new number of elements
- * after duplicate removal is returned.
- *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
  */
-static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
-						Datum *elems, int nelems)
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
 {
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 	Oid			elemtype;
 	RegProcedure cmp_proc;
-	BTSortArrayContext cxt;
-
-	if (nelems <= 1)
-		return nelems;			/* no work to do */
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
 
 	/*
 	 * Determine the nominal datatype of the array elements.  We have to
@@ -471,12 +580,10 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 	 * Look up the appropriate comparison function in the opfamily.
 	 *
 	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
+	 * incomplete.
 	 */
 	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
+								 rel->rd_opcintype[skey->sk_attno - 1],
 								 elemtype,
 								 BTORDER_PROC);
 	if (!RegProcedureIsValid(cmp_proc))
@@ -484,8 +591,32 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 			 BTORDER_PROC, elemtype, elemtype,
 			 rel->rd_opfamily[skey->sk_attno - 1]);
 
+	/* Save in orderproc entry for attribute */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
+/*
+ * _bt_sort_array_elements() -- sort and de-dup array elements
+ *
+ * The array elements are sorted in-place, and the new number of elements
+ * after duplicate removal is returned.
+ *
+ * scan and skey identify the index column, whose opfamily determines the
+ * comparison semantics.  If reverse is true, we sort in descending order.
+ */
+static int
+_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
+						bool reverse,
+						Datum *elems, int nelems)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+
+	if (nelems <= 1)
+		return nelems;			/* no work to do */
+
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan key's have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+	Datum	   *merged = palloc(sizeof(Datum) * nelems_orig);
+	int			merged_nelems = 0;
+
+	/*
+	 * Incrementally copy the original array into a temp buffer, skipping over
+	 * any items that are missing from the "next" array
+	 */
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0; i < nelems_orig; i++)
+	{
+		Datum	   *elem = elems_orig + i;
+
+		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+						_bt_compare_array_elements, &cxt))
+			merged[merged_nelems++] = *elem;
+	}
+
+	/*
+	 * Overwrite the original array with temp buffer so that we're only left
+	 * with intersecting array elements
+	 */
+	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	pfree(merged);
+
+	return merged_nelems;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -507,7 +680,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+	Assert((cur->sk_flags & SK_ROW_HEADER) == 0);
+
+	if (cur->sk_flags & SK_ISNULL)	/* array/scan key is NULL */
+	{
+		if (tupnull)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NULL "<" NOT_NULL */
+		else
+			result = -1;		/* NULL ">" NOT_NULL */
+	}
+	else if (tupnull)			/* array/scan key is NOT_NULL and tuple item
+								 * is NULL */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NOT_NULL ">" NULL */
+		else
+			result = 1;			/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index
+		 * tuple.  (Array scan keys cannot be cross-type, but other required
+		 * scan keys that use an equal operator can be.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * Unlike _bt_compare, we flip the sign when column is a DESC column
+		 * (and *not* when column is ASC).  This matches the approach taken by
+		 * _bt_check_rowcompare, which performs similar three-way comparisons.
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  This (and information about the scan's direction) allows
+ * searches against required scan key arrays to reuse earlier search bounds as
+ * an optimization.
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to caller's tupdatum argument.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *final_result)
+{
+	int			low_elem,
+				mid_elem,
+				high_elem,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+	Assert(!cur_elem_start ||
+		   array->elem_values[array->cur_elem] == cur->sk_argument);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		if (cur_elem_start)
+			low_elem = array->cur_elem;
+		else
+			low_elem = 0;
+		high_elem = array->num_elems - 1;
+	}
+	else
+	{
+		low_elem = 0;
+		if (cur_elem_start)
+			high_elem = array->cur_elem;
+		else
+			high_elem = array->num_elems - 1;
+	}
+	mid_elem = -1;
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * it's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the array element we'll return.  We set *final_result with
+	 * the result of that comparison specifically.
+	 *
+	 * Avoid setting *final_result to the wrong comparison's result.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*final_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -539,30 +867,35 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array is still at its final element for scan direction).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		found = false;
-	int			i;
+
+	Assert(!so->needPrimScan);
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
@@ -596,19 +929,31 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+	if (found)
+		return true;
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * Don't allow the entire set of array keys to roll over: restore the
+	 * array keys to the state they were in before we were called.
+	 *
+	 * This ensures that the array keys only ratchet forward (or backwards in
+	 * the case of backward scans).  Our "so->arrayKeyData" scan keys should
+	 * always match the current "so->keyData" search-type scan keys (except
+	 * for a brief moment during array key advancement).
 	 */
-	if (!found)
-		so->arraysStarted = false;
+	for (int i = 0; i < so->numArrayKeys; i++)
+	{
+		BTArrayKeyInfo *rollarray = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[rollarray->scan_key];
 
-	return found;
+		if (ScanDirectionIsBackward(dir))
+			rollarray->cur_elem = 0;
+		else
+			rollarray->cur_elem = rollarray->num_elems - 1;
+		skey->sk_argument = rollarray->elem_values[rollarray->cur_elem];
+	}
+
+	return false;
 }
 
 /*
@@ -622,6 +967,8 @@ _bt_mark_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			i;
 
+	Assert(_bt_verify_array_scankeys(scan));
+
 	for (i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
@@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
 	 * sound like overkill, but in cases with multiple keys per index column
 	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
 	 */
-	if (changed || !so->arraysStarted)
+	if (changed)
 	{
 		_bt_preprocess_keys(scan);
 		/* The mark should have been set on a consistent set of keys... */
 		Assert(so->qual_ok);
 	}
+
+	Assert(_bt_verify_array_scankeys(scan));
 }
 
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) must advance the scan's array keys.
+ * Only call here when _bt_check_compare already set continuescan=false.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfying our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it is now time for our
+ * caller to advance the array keys (unless caller broke the rules by not
+ * checking with _bt_check_compare before calling here).
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * function header comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	bool		tuple_before_array_keys = false;
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull,
+					skrequired;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required.
+		 *
+		 * Equality strategy scan keys are either required in both directions
+		 * or neither direction, so the current scan direction doesn't need to
+		 * be tested here.
+		 */
+		skrequired = (cur->sk_flags & SK_BT_REQFWD);
+		Assert(!skrequired || (cur->sk_flags & SK_BT_REQBKWD));
+
+		/*
+		 * Unlike _bt_advance_array_keys, we never deal with any non-required
+		 * array keys.  Cases where skrequiredtrigger is set to false by
+		 * _bt_check_compare should never call here.  We are only called after
+		 * _bt_check_compare provisionally indicated that the scan should be
+		 * terminated due to a _required_ scan key not being satisfied.
+		 *
+		 * We expect _bt_check_compare to notice and report required scan keys
+		 * before non-required ones.  _bt_advance_array_keys might still have
+		 * to advance non-required array keys in passing for a tuple that we
+		 * were called for, but it doesn't need advanced notice of that from
+		 * us.
+		 */
+		if (!skrequired)
+			break;
+
+		if (attnum > ntupatts)
+		{
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's search-type scan keys
+			 */
+			break;
+		}
+
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		if (result != 0)
+		{
+			if (ScanDirectionIsForward(dir))
+				tuple_before_array_keys = result < 0;
+			else
+				tuple_before_array_keys = result > 0;
+
+			break;
+		}
+	}
+
+	return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if _bt_checkkeys determined that another primitive index scan
+ * must take place by calling _bt_first.  Otherwise returns false, indicating
+ * that caller's top-level scan is now past the point where further matching
+ * index tuples can be found (for the current scan direction).
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ * All other scans should just call _bt_first once, no matter what.
+ *
+ * Top-level index scans executed via multiple primitive index scans must not
+ * fail to output index tuples in the usual order for the index -- just like
+ * any other index scan would.  The state machine that manages the scan's
+ * array keys must only start primitive index scans when they cover key space
+ * strictly greater than the key space for tuples that the scan has already
+ * returned (or strictly less in the backwards scan case).  Otherwise the scan
+ * could output the same index tuples more than once, or in the wrong order.
+ *
+ * This is managed by limiting the cases that can trigger new primitive index
+ * scans to those involving required array scan keys and/or other required
+ * scan keys that use the equality strategy.  In particular, the state machine
+ * must not allow high order required scan keys using an inequality strategy
+ * (which are only required in one scan direction) to directly trigger a new
+ * primitive index scan that advances low order non-required array scan keys.
+ * For example, a query such as "SELECT thousand, tenthous FROM tenk1 WHERE
+ * thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand" whose execution
+ * involves a scan of an index on "(thousand, tenthous)" must perform no more
+ * than a single primitive index scan.  Otherwise we risk outputting tuples in
+ * the wrong order.  Array key values for the non-required scan key on the
+ * "tenthous" column must not dictate top-level scan order.  Primitive index
+ * scans mustn't scan tuples already scanned by some earlier primitive scan.
+ *
+ * In fact, nbtree makes a stronger guarantee than is strictly necessary here:
+ * it guarantees that the top-level scan won't repeat any leaf page reads.
+ * (Actually, that can still happen when the scan is repositioned, or the scan
+ * direction changes -- but that's just as true with other types of scans.)
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * opportunistically advancing the scan's array keys when it allows the
+	 * primitive index scan to find nearby matching tuples (or to eliminate
+	 * array keys with no matching tuples from further consideration).
+	 *
+	 * _bt_checkkeys sets a simple flag variable that we check here.  This
+	 * tells us if we need to perform another primitive index scan for the
+	 * now-current array keys or not.  We'll unset the flag once again to
+	 * acknowledge having started a new primitive scan (or we'll see that it
+	 * isn't set and end the top-level scan right away).
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys here.  There
+	 * are various scenarios where that won't happen.  For example, if the
+	 * index is completely empty, then _bt_first won't get as far as calling
+	 * _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys.
+	 *
+	 * There is one exception: the case where _bt_first's _bt_preprocess_keys
+	 * call determined that the scan's input scan keys can never be satisfied.
+	 * That might be true for one set of array keys, but not the next set.
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * Defensively check for interrupts -- the scan's next call to
+		 * _bt_first won't be able to do so if the next set of keys also turn
+		 * out to be unsatisfiable
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Can't use _bt_advance_array_keys so use incremental advancement */
+		so->needPrimScan = false;
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+	}
+
+	/* Time for another primitive index scan? */
+	if (so->needPrimScan)
+	{
+		/* Have our caller call _bt_first once more */
+		so->needPrimScan = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	/*
+	 * No more primitive index scans.  Terminate the top-level scan.
+	 */
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false.  When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple.  The
+ * second call should be allowed to set pstate.continuescan=false without
+ * further intervention, since tuple must be <= the array keys after we're
+ * called (actually, that guarantee applies to all required equality-type scan
+ * keys, and does not apply to non-required array keys).
+ *
+ * When called with skrequiredtrigger=true, the call only expects to have to
+ * deal with non-required equality array keys.  The rules are a little
+ * different during these calls.  We'll always set pstate.continuescan=true,
+ * since (by definition) a non-required scan key never terminates the scan.
+ *
+ * If we reach the end of all of the required array keys for the current scan
+ * direction, we will effectively end the top-level index scan.
+ *
+ * This function will always advance the array keys by at least one increment
+ * (except when it ends the top-level index scan having reached a tuple beyond
+ * the scan's final array key, and except during !skrequiredtrigger calls).
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys.  Calling here before that
+ * point will prematurely advance the array keys, leading to wrong query
+ * results (though this precondition is checked here via an assertion).
+ *
+ * We're responsible for ensuring that caller's tuple is <= current/newly
+ * advanced required array keys once we return (this postcondition is also
+ * checked via another assertion).  We try to find an exact match, but failing
+ * that we'll advance the array keys to whatever set of keys comes next in the
+ * key space (among the keys that we actually have).  In general, the scan's
+ * array keys can only ever "ratchet forwards", progressing in lock step with
+ * the scan.
+ *
+ * (The invariants are the same for backwards scans, except that the operators
+ * are flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple.  We fall back on "incrementally" advancing the
+ * array keys in these cases, which all involve non-array scan keys.  For
+ * example, with a composite index on (a, b) and a qual "WHERE a IN (3,5) AND
+ * b < 42", we'll be called for both "a" keys (i.e. keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42".  Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called.  The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)".  The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time.  This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys.  We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, bool skrequiredtrigger)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_exhausted,
+				beyond_end_advance = false,
+				all_eqtype_sk_equal = true,
+				all_required_eqtype_sk_equal PG_USED_FOR_ASSERTS_ONLY = true;
+
+	/*
+	 * Must only be called when tuple is >= current required array keys
+	 * (except during backwards scans, when it must be <= the array keys)
+	 */
+	Assert(_bt_verify_array_scankeys(scan));
+	Assert(!skrequiredtrigger ||
+		   !_bt_tuple_before_array_skeys(scan, pstate, tuple));
+
+	/*
+	 * Try to advance array keys via a series of binary searches.
+	 *
+	 * Loop iterates through the current scankeys (so->keyData, which were
+	 * output by _bt_preprocess_keys earlier) and then sets input scan keys
+	 * (so->arrayKeyData scan keys) to new array values.
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno,
+					set_elem = 0;
+		Datum		tupdatum;
+		bool		skrequired,
+					tupnull;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required.
+		 *
+		 * Equality strategy scan keys are either required in both directions
+		 * or neither direction, so the current scan direction doesn't need to
+		 * be tested here.
+		 */
+		skrequired = (cur->sk_flags & SK_BT_REQFWD);
+		Assert(!skrequired || (cur->sk_flags & SK_BT_REQBKWD));
+
+		/*
+		 * Set up ORDER 3-way comparison function and array state
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		if (cur->sk_flags & SK_SEARCHARRAY)
+		{
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+			Assert(skeyarray->sk_attno == attnum);
+		}
+
+		/*
+		 * Optimization: Skip over non-required scan keys when we know that
+		 * they can't have changed (because _bt_check_compare triggered this
+		 * call due to encountering an unsatisified non-required array qual)
+		 */
+		if (skrequired && !skrequiredtrigger)
+		{
+			Assert(!beyond_end_advance && !arrays_advanced);
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 *
+		 * We help to make sure that the array keys are ultimately advanced
+		 * such that caller's tuple is < final array keys (or > final keys).
+		 * We're behind the scan right now, but we'll fully "catch up" once
+		 * outside the loop (we'll be immediately ahead of this tuple).  See
+		 * below for a detailed explanation.
+		 *
+		 * NB: We must do this for all arrays -- not just required arrays.
+		 * Otherwise the final incremental array advancement step (that takes
+		 * place just outside the loop) won't "carry" in the way we expect.
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			Assert(skrequiredtrigger);
+			Assert(!all_eqtype_sk_equal && !all_required_eqtype_sk_equal);
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				skeyarray->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * required scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set to a value from the tuple, since the value is actually from
+		 * a different part of the key space.
+		 *
+		 * Also perform the same steps with truncated high key attributes.
+		 * You can think of this as a "binary search" for the element closest
+		 * to the value -inf.  This is another case where we have to avoid
+		 * getting too far ahead of the scan.
+		 */
+		if (!all_eqtype_sk_equal || attnum > ntupatts)
+		{
+			int			first_elem_dir;
+
+			Assert((skrequiredtrigger && arrays_advanced) ||
+				   attnum > ntupatts);
+			Assert(!beyond_end_advance);
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		if (!array)
+		{
+			if (!skrequired)
+				continue;
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+		else
+		{
+			/* Determine if search bounds are reusable (optimization) */
+			bool		cur_elem_start = (skrequired && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, cur_elem_start, dir,
+											  tupdatum, tupnull, array, cur,
+											  &result);
+		}
+
+		/* Consider advancing array keys */
+		Assert(!array || (set_elem >= 0 && set_elem < array->num_elems));
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+
+			/*
+			 * We shouldn't have to advance a required array when called due
+			 * to _bt_check_compare determining that a non-required array
+			 * needs to be advanced.  We expect _bt_check_compare to notice
+			 * and report required scan keys before non-required ones.
+			 */
+			Assert(skrequiredtrigger || !skrequired);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet the array
+		 * forward (backward) by one position, so that the array is set to a
+		 * value < the tuple attribute value instead (or to a value > tuple's
+		 * value).
+		 *
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just used for this array happens to have been the final element
+		 * (final for the current scan direction).  That's why we don't handle
+		 * this issue by modifying this array's set_elem (that won't "carry").
+		 *
+		 * Our approach is to set each subsequent lower-order array to its
+		 * final element.  We'll then advance the array keys incrementally,
+		 * just outside the loop.  That way earlier/higher order arrays
+		 * (arrays before _this_ array) can advance as and when required.
+		 *
+		 * The array keys advance a little like the way that an mileage gauge
+		 * advances.  Imagine a mechanical display that rolls over from 999 to
+		 * 000 every time we drive our car another 1,000 miles.  Each decimal
+		 * digit behaves a little like an array from the array state machine
+		 * implemented by this function.
+		 *
+		 * Suppose we have 3 array keys a, b, and c.  Each "digit"/array has
+		 * 10 distinct elements that happen to match across each array: values
+		 * 0 through to 9.  Caller's tuple "(a, b, c) = (3, 7.9, 2)" might
+		 * initially have its "b" array advanced up to the value 7 (7 being
+		 * the closest match the "b" array has), and its "c" array advanced up
+		 * to 9.  The incremental advancement step (outside the loop) will
+		 * then finish the process by "advancing" (actually, rolling over) the
+		 * array on "c" to the value 0, which would immediately carry over to
+		 * "b", which will then advance to the value 8 ("rounding up" from 7).
+		 * Under this scheme, the array keys only ever ratchet forward, and
+		 * array key advancement by us takes place as infrequently as possible
+		 * (see also: this function's postcondition assertions, below).
+		 *
+		 * Incremental advancement can also carry all the way past the most
+		 * significant array, exhausting all of the scan's array keys in one
+		 * step.  Suppose, for example, that a later call here passes a tuple
+		 * "(a, b, c) = (9, 9.9, 4)".  Once again we can't find an exact match
+		 * for "b", so we'll set beyond_end_advance.  This time, incremental
+		 * advancement rolls over all the way past "a", the most significant
+		 * array.  _bt_advance_array_keys_increment will return false when
+		 * this happens, indicating that all array keys are now exhausted.
+		 * This triggers the end of the top-level index scan below.
+		 */
+		Assert(!beyond_end_advance);
+		if (skrequired &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		/*
+		 * Also track whether all attributes from the tuple are equal to the
+		 * array keys that we'll be advancing to (or to existing array keys
+		 * that didn't need to be advanced)
+		 */
+		if (result != 0)
+		{
+			all_eqtype_sk_equal = false;
+			if (skrequired)
+				all_required_eqtype_sk_equal = false;
+
+			/* Just skip if triggered by a non-required scan key */
+			if (!skrequiredtrigger)
+				break;
+		}
+	}
+
+	/*
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.
+	 *
+	 * Also fall back on incremental advancement in cases where we couldn't
+	 * advance the array keys any other way.  See function header comments for
+	 * an example of this, where inequality-type scan keys alone drive array
+	 * key advancement.  (We don't directly deal with inequality type scan
+	 * keys here, but cases that use the fallback must involve inequalities.)
+	 */
+	arrays_exhausted = false;
+	if ((beyond_end_advance || !arrays_advanced) && skrequiredtrigger)
+	{
+		/* Fallback case must have all-equal equality type scan keys */
+		Assert(beyond_end_advance || all_required_eqtype_sk_equal);
+
+		if (!_bt_advance_array_keys_increment(scan, dir))
+			arrays_exhausted = true;
+		else
+			arrays_advanced = true;
+
+		/*
+		 * The newly advanced array keys won't be equal anymore, so remember
+		 * that in order to avoid a second _bt_check_compare call for tuple
+		 */
+		all_eqtype_sk_equal = all_required_eqtype_sk_equal = false;
+	}
+
+	Assert(arrays_exhausted || arrays_advanced || !skrequiredtrigger);
+
+	/*
+	 * If we haven't yet exhausted all required array scan keys, allow the
+	 * ongoing primitive index scan to continue
+	 */
+	pstate->continuescan = !arrays_exhausted;
+
+	/* Cannot set continuescan=false when called for non-required array */
+	Assert(pstate->continuescan || skrequiredtrigger);
+
+	if (arrays_advanced)
+	{
+		/*
+		 * We advanced the array keys, and so must perform a targeted form of
+		 * in-place preprocessing of the scan's search-type scan keys.
+		 *
+		 * If we missed this final step then any call to _bt_check_compare
+		 * would use stale array keys until such time as _bt_preprocess_keys
+		 * was once again called by _bt_first.  But it's a good idea to do
+		 * this even when there won't be another primitive index scan.
+		 */
+		_bt_preprocess_keys_leafbuf(scan);
+
+		/*
+		 * If any required array keys were advanced, be prepared to recheck
+		 * the final tuple against the new array keys (as an optimization)
+		 */
+		if (skrequiredtrigger)
+			pstate->finaltupchecked = false;
+	}
+
+	/*
+	 * Postcondition assertions.
+	 *
+	 * Tuple must now be <= current/newly advanced required array keys.  Same
+	 * goes for other required equality type scan keys, which are "degenerate
+	 * single value arrays" for our purposes.  (As usual the rule is the same
+	 * for backwards scans, but the operator is flipped: tuple must be >= new
+	 * array keys.)
+	 *
+	 * We're stricter than that in cases where the tuple was already equal to
+	 * the previous array keys when we were called: tuple must now be < the
+	 * new array keys (or > the array keys).  This is a consequence of the
+	 * fallback on incremental advancement used to indirectly handle cases
+	 * where an inequality triggers array key advancement.  (See function
+	 * header comments for an example of this.)
+	 *
+	 * Our caller decides when to start primitive index scans based in part on
+	 * the current array keys.  It always needs to see a precise array-wise
+	 * picture of the scan's progress.  If we ever advanced the array keys by
+	 * less than the exact maximum safe amount, our caller might go on to make
+	 * subtly wrong decisions about when to quit the ongoing primitive scan.
+	 * (These assertions won't reliably detect every case where the array keys
+	 * haven't advance by the expected/maximum amount, but they come close.)
+	 */
+	Assert(_bt_verify_array_scankeys(scan));
+	Assert(_bt_tuple_before_array_skeys(scan, pstate, tuple) ==
+		   (!all_required_eqtype_sk_equal && !arrays_exhausted));
+
+	/* All-equal required equality keys shouldn't be from before this call */
+	Assert(!all_required_eqtype_sk_equal || !skrequiredtrigger ||
+		   arrays_advanced || arrays_exhausted);
+
+	return all_eqtype_sk_equal && pstate->continuescan;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -749,6 +1767,21 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here.  They
+ * expect to be able to call _bt_preprocess_keys_leafbuf instead (a stripped
+ * down version of this function that's specialized to array key index scans).
+ * We need to be careful about that case here when we determine redundancy;
+ * equality quals must not be eliminated as redundant on the basis of array
+ * input keys that might change before another call here takes place.
+ *
+ * Note, however, that the presence of an array scan key doesn't affect how we
+ * determine if index quals are contradictory.  Contradictory qual scans move
+ * on to the next primitive index scan right away, by incrementing the scan's
+ * array keys once control reaches _bt_array_keys_remain.  There won't ever be
+ * a call to _bt_preprocess_keys_leafbuf before the next call here, so there
+ * is nothing for us to break.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -895,8 +1928,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							so->qual_ok = false;
 							return;
 						}
-						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						else if (!(eq->sk_flags & SK_SEARCHARRAY))
+						{
+							/* else discard the redundant non-equality key */
+							xform[j] = NULL;
+						}
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -986,6 +2022,22 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
+		/*
+		 * Is this an array scan key that _bt_preprocess_array_keys merged
+		 * with some earlier array key during its initial preprocessing pass?
+		 */
+		if (cur->sk_flags & SK_BT_RDDNARRAY)
+		{
+			/*
+			 * key is redundant for this primitive index scan (and will be
+			 * redundant during all subsequent primitive index scans)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(j == (BTEqualStrategyNumber - 1));
+			Assert(so->numArrayKeys > 0);
+			continue;
+		}
+
 		/* have we seen one of these before? */
 		if (xform[j] == NULL)
 		{
@@ -999,7 +2051,26 @@ _bt_preprocess_keys(IndexScanDesc scan)
 										 &test_result))
 			{
 				if (test_result)
-					xform[j] = cur;
+				{
+					if (j == (BTEqualStrategyNumber - 1) &&
+						((xform[j]->sk_flags & SK_SEARCHARRAY) ||
+						 (cur->sk_flags & SK_SEARCHARRAY)))
+					{
+						/*
+						 * Must never replace an = array operator ourselves,
+						 * nor can we ever fail to remember an = array
+						 * operator.  _bt_preprocess_keys_leafbuf expects
+						 * this.
+						 */
+						ScanKey		outkey = &outkeys[new_numberOfKeys++];
+
+						memcpy(outkey, cur, sizeof(ScanKeyData));
+						if (numberOfEqualCols == attno - 1)
+							_bt_mark_scankey_required(outkey);
+					}
+					else
+						xform[j] = cur;
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1027,6 +2098,96 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+/*
+ *	_bt_preprocess_keys_leafbuf() -- Preprocess array scan keys only
+ *
+ * Stripped down version of _bt_preprocess_keys that can be called with a
+ * buffer lock held.  Reuses much of the work performed during the previous
+ * _bt_preprocess_keys call.
+ *
+ * This function just transfers newly advanced array keys that were set in
+ * "so->arrayKeyData" to corresponding "so->keyData" search-type scan keys.
+ * It does not independently detect redunant or contradictory scan keys.
+ */
+static void
+_bt_preprocess_keys_leafbuf(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	Assert(so->qual_ok);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		Assert((cur->sk_flags & SK_BT_RDDNARRAY) == 0);
+
+		/* Just update equality array scan keys */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Update the scan key's argument */
+		Assert(cur->sk_attno == skeyarray->sk_attno);
+		cur->sk_argument = skeyarray->sk_argument;
+	}
+
+	Assert(arrayidx == so->numArrayKeys);
+}
+
+/*
+ * Verify that the scan's "so->arrayKeyData" scan keys are in agreement with
+ * the current "so->keyData" search-type scan keys.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_verify_array_scankeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Verify so->arrayKeyData input scan key has expected sk_argument */
+		if (skeyarray->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+
+		/* Verify so->arrayKeyData input scan key agrees with output key */
+		if (cur->sk_attno != skeyarray->sk_attno)
+			return false;
+		if (cur->sk_argument != skeyarray->sk_argument)
+			return false;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1360,41 +2521,198 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the final
+ * tuple (the high key for a forward scan) early, before we've expended too
+ * much effort on comparing tuples that cannot possibly be matches for any set
+ * of array keys.  This is just an optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate.  These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards).  Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction).  Any other order will
+ * lead to inconsistent array key state.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
  * requiredMatchedByPrecheck: indicates that scan keys required for
  * 							  direction scan are already matched
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup,
 			  bool requiredMatchedByPrecheck)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	int			natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+	bool		skrequiredtrigger;
+
+	Assert(pstate->continuescan);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+							&pstate->continuescan, &skrequiredtrigger,
+							requiredMatchedByPrecheck);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.
+	 *
+	 * When there are array scan keys then we can still accept the first
+	 * answer we get from _bt_check_compare when continuescan wasn't unset.
+	 */
+	if (!so->numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare set continuescan=false in the presence of equality
+	 * type array keys.  It's possible that we haven't reached the start of
+	 * the array keys just yet.  It's also possible that we need to advance
+	 * the array keys now.  (Or perhaps we really do need to terminate the
+	 * top-level scan.)
+	 */
+	pstate->continuescan = true;	/* new initial assumption */
+
+	if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+	{
+		/*
+		 * Tuple is still < the current array scan key values (as well as
+		 * other equality type scan keys) if this is a forward scan.
+		 * (Backwards scans reach here with a tuple > equality constraints.)
+		 * We must now consider how to proceed with the ongoing primitive
+		 * index scan.
+		 *
+		 * Should _bt_readpage continue with this page for now, in the hope of
+		 * finding tuples whose key space is covered by the current array keys
+		 * before too long?  Or, should it give up and start a new primitive
+		 * index scan instead?
+		 *
+		 * Our policy is to terminate the primitive index scan at the end of
+		 * the current page if the current (most recently advanced) array keys
+		 * don't cover the final tuple from the page.  This policy is fairly
+		 * conservative overall.  Note, however, that our policy effectively
+		 * infers what the next sibling page is likely to look like based on
+		 * details from the current page (in particular its final tuple).
+		 *
+		 * It's possible that we'll gamble and lose: a grouping of tuples
+		 * covered by the current array keys could be aligned with the key
+		 * space boundaries of the current leaf page, without any later array
+		 * keys having key space that is covered by the next sibling page.
+		 */
+		if (finaltup || (!pstate->finaltupchecked && pstate->finaltup &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->finaltup)))
+		{
+			/*
+			 * This is the final tuple (the high key for forward scans, or the
+			 * tuple at the first offset number for backward scans), but it is
+			 * still before the current array keys.  As such, we're unwilling
+			 * to allow the current primitive index scan to continue to the
+			 * next leaf page.  Start a new primitive index scan that will
+			 * reposition the top-level scan to the first leaf page whose key
+			 * space is covered by our _current_ array keys.  We expect that
+			 * this process will effectively make the scan "skip over" a group
+			 * of leaf pages that cannot possibly contain any matching tuples.
+			 *
+			 * Note: _bt_readpage stashes the final tuple, which allows us to
+			 * make this check early.  We thereby avoid comparing very many
+			 * extra tuples on the page.  This is just an optimization;
+			 * skipping these useless comparisons should never change our
+			 * final conclusion about what the scan should do next.
+			 */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else if (!finaltup && pstate->finaltup)
+		{
+			/*
+			 * Remember that the final tuple has been checked with this
+			 * particular set of array keys.
+			 *
+			 * It might make sense to check the same tuple again at some point
+			 * during the ongoing _bt_readpage-wise scan of this page.  But it
+			 * is definitely wasteful to repeat the same check before the
+			 * array keys are advanced by some later non-final tuple.
+			 */
+			pstate->finaltupchecked = true;
+		}
+
+		/*
+		 * In any case, this indextuple doesn't match the qual
+		 */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).
+	 *
+	 * It it now time to advance the array keys based on the values from this
+	 * tuple.  Do that now, while determining in passing if the tuple matches
+	 * the newly advanced set of array keys (if we've any left).
+	 *
+	 * This call will also set continuescan for us (or tells us to perform
+	 * another _bt_check_compare call, which then sets continuescan for us).
+	 */
+	if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+	{
+		/*
+		 * Tuple doesn't match any later array keys, either.  Give up on this
+		 * tuple being a match.
+		 */
+		return false;
+	}
+
+	/*
+	 * Advanced array keys to values that are exact matches for corresponding
+	 * attribute values from the tuple.  Check back with _bt_check_compare.
+	 */
+	return _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+							 &pstate->continuescan, &skrequiredtrigger,
+							 false);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool *continuescan, bool *skrequiredtrigger,
+				  bool requiredMatchedByPrecheck)
+{
 	int			ikey;
 	ScanKey		key;
 
-	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!so->numArrayKeys || !requiredMatchedByPrecheck);
 
 	*continuescan = true;		/* default assumption */
+	*skrequiredtrigger = true;	/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = so->keyData, ikey = 0; ikey < so->numberOfKeys; key++, ikey++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -1526,7 +2844,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * _bt_first() except for the NULLs checking, which have already done
 		 * above.
 		 */
-		if (!requiredOppositeDir)
+		if (!requiredOppositeDir || so->numArrayKeys)
 		{
 			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
 									 datum, key->sk_argument);
@@ -1549,10 +2867,22 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * qual fails, it is critical that equality quals be used for the
 			 * initial positioning in _bt_first() when they are available. See
 			 * comments in _bt_first().
+			 *
+			 * Scans with equality-type array scan keys run into a similar
+			 * problem whenever they advance the array keys.  Our caller uses
+			 * _bt_tuple_before_array_skeys to avoid the problem there.
 			 */
 			if (requiredSameDir)
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+			{
+				if (!requiredSameDir)
+					*skrequiredtrigger = false;
+				*continuescan = false;
+			}
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1571,7 +2901,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 03a5fbdc6..e37597c26 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +862,20 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			/*
+			 * We might need to omit ScalarArrayOpExpr clauses when index AM
+			 * lacks native support
+			 */
+			if (!index->amsearcharray && IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
+				if (skip_nonnative_saop)
 				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
+					/* Ignore because not supported by index */
+					*skip_nonnative_saop = true;
+					continue;
 				}
+				/* Caller had better intend this only for bitmap scan */
+				Assert(scantype == ST_BITMAPSCAN);
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..1b899b2db 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				int			alength = estimate_array_length(other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6832,6 +6801,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6850,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6858,7 +6869,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e068f7e24..da90412d5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Only queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values are affected.  See
+    <xref linkend="functions-comparisons"/> for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 892ea5f17..f4939cd74 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 SET enable_indexonlyscan = OFF;
 
 explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 RESET enable_indexonlyscan;
 
 --
-- 
2.42.0

#21

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Peter Geoghegan (#20)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

If you end up finding a bug in this v6, it'll most likely be a case
where nbtree fails to live up to that. This project is as much about
robust/predictable performance as anything else -- nbtree needs to be
able to cope with practically anything. I suggest that your review
start by trying to break the patch along these lines.

I spent some time on this myself today (which I'd already planned on).

Attached is an adversarial stress-test, which shows something that
must be approaching the worst case for the patch in terms of time
spent with a buffer lock held, due to spending so much time evaluating
unusually expensive SAOP index quals. The array binary searches that
take place with a buffer lock held aren't quite like anything else
that nbtree can do right now, so it's worthy of special attention.

I thought of several factors that maximize both the number of binary
searches within any given _bt_readpage, as well as the cost of each
binary search -- the SQL file has full details. My test query is
*extremely* unrealistic, since it combines multiple independent
unrealistic factors, all of which aim to make life hard for the
implementation in one way or another. I hesitate to say that it
couldn't be much worse (I only spent a few hours on this), but I'm
prepared to say that it seems very unlikely that any real world query
could make the patch spend as many cycles in
_bt_readpage/_bt_checkkeys as this one does.

Perhaps you can think of some other factor that would make this test
case even less sympathetic towards the patch, Matthias? The only thing
I thought of that I've left out was the use of a custom btree opclass,
"unrealistically_slow_ops". Something that calls pg_usleep in its
order proc. (I left it out because it wouldn't prove anything.)

On my machine, custom instrumentation shows that each call to
_bt_readpage made while this query executes (on a patched server)
takes just under 1.4 milliseconds. While that is far longer than it
usually takes, it's basically acceptable IMV. It's not significantly
longer than I'd expect heap_index_delete_tuples() to take on an
average day with EBS (or other network-attached storage). But that's a
process that happens all the time, with an exclusive buffer lock held
on the leaf page throughout -- whereas this is only a shared buffer
lock, and involves a query that's just absurd .

Another factor that makes this seem acceptable is just how sensitive
the test case is to everything going exactly and perfectly wrong, all
at the same time, again and again. The test case uses a 32 column
index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
clauses (one per index column). If I reduce the number of SAOP clauses
in the query to (say) 8, I still have a test case that's almost as
silly as my original -- but now we only spend ~225 microseconds in
each _bt_readpage call (i.e. we spend over 6x less time in each
_bt_readpage call). (Admittedly if I also make the CREATE INDEX use
only 8 columns, we can fit more index tuples on one page, leaving us
at ~800 microseconds).

I'm a little surprised that it isn't a lot worse than this, given how
far I went. I was a little concerned that it would prove necessary to
lock this kind of thing down at some higher level (e.g., in the
planner), but that now looks unnecessary. There are much better ways
to DOS the server than this. For example, you could run this same
query while forcing a sequential scan! That appears to be quite a lot
less responsive to interrupts (in addition to being hopelessly slow),
probably because it uses parallel workers, each of which will use
wildly expensive filter quals that just do a linear scan of the SAOP.

--
Peter Geoghegan

#22

Matthias van de Meent

boekewurm+postgres@gmail.com

about 2 years ago

In reply to: Peter Geoghegan (#21)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Fri, 10 Nov 2023 at 00:58, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

If you end up finding a bug in this v6, it'll most likely be a case
where nbtree fails to live up to that. This project is as much about
robust/predictable performance as anything else -- nbtree needs to be
able to cope with practically anything. I suggest that your review
start by trying to break the patch along these lines.

I spent some time on this myself today (which I'd already planned on).

Attached is an adversarial stress-test, which shows something that
must be approaching the worst case for the patch in terms of time
spent with a buffer lock held, due to spending so much time evaluating
unusually expensive SAOP index quals. The array binary searches that
take place with a buffer lock held aren't quite like anything else
that nbtree can do right now, so it's worthy of special attention.

I thought of several factors that maximize both the number of binary
searches within any given _bt_readpage, as well as the cost of each
binary search -- the SQL file has full details. My test query is
*extremely* unrealistic, since it combines multiple independent
unrealistic factors, all of which aim to make life hard for the
implementation in one way or another. I hesitate to say that it
couldn't be much worse (I only spent a few hours on this), but I'm
prepared to say that it seems very unlikely that any real world query
could make the patch spend as many cycles in
_bt_readpage/_bt_checkkeys as this one does.

Perhaps you can think of some other factor that would make this test
case even less sympathetic towards the patch, Matthias? The only thing
I thought of that I've left out was the use of a custom btree opclass,
"unrealistically_slow_ops". Something that calls pg_usleep in its
order proc. (I left it out because it wouldn't prove anything.)

Have you tried using text index columns that are sorted with
non-default locales?
I've seen non-default locales use significantly more resources during
compare operations than any other ordering operation I know of (which
has mostly been in finding the locale), and use it extensively to test
improvements for worst index shapes over at my btree patchsets because
locales are dynamically loaded in text compare and nondefault locales
are not cached at all. I suspect that this would be even worse if a
somehow even worse locale path is available than what I'm using for
test right now; this could be the case with complex custom ICU
locales.

On my machine, custom instrumentation shows that each call to
_bt_readpage made while this query executes (on a patched server)
takes just under 1.4 milliseconds. While that is far longer than it
usually takes, it's basically acceptable IMV. It's not significantly
longer than I'd expect heap_index_delete_tuples() to take on an
average day with EBS (or other network-attached storage). But that's a
process that happens all the time, with an exclusive buffer lock held
on the leaf page throughout -- whereas this is only a shared buffer
lock, and involves a query that's just absurd .

Another factor that makes this seem acceptable is just how sensitive
the test case is to everything going exactly and perfectly wrong, all
at the same time, again and again. The test case uses a 32 column
index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
clauses (one per index column). If I reduce the number of SAOP clauses
in the query to (say) 8, I still have a test case that's almost as
silly as my original -- but now we only spend ~225 microseconds in
each _bt_readpage call (i.e. we spend over 6x less time in each
_bt_readpage call). (Admittedly if I also make the CREATE INDEX use
only 8 columns, we can fit more index tuples on one page, leaving us
at ~800 microseconds).

A quick update of the table definition to use the various installed
'fr-%-x-icu' locales on text hash columns instead of numeric with a
different collation for each column this gets me to EXPLAIN (analyze)
showing 2.07ms spent every buffer hit inside the index scan node, as
opposed to 1.76ms when using numeric. But, as you mention, the value
of this metric is probably not very high.

As for the patch itself, I'm probably about 50% through the patch now.
While reviewing, I noticed the following two user-visible items,
related to SAOP but not broken by or touched upon in this patch:

1. We don't seem to plan `column opr ALL (...)` as index conditions,
while this should be trivial to optimize for at least btree. Example:

SET enable_bitmapscan = OFF;
WITH a AS (select generate_series(1, 1000) a)
SELECT * FROM tenk1
WHERE thousand = ANY (array(table a))
AND thousand < ALL (array(table a));

This will never return any rows, but it does hit 9990 buffers in the
new btree code, while I expected that to be 0 buffers based on the
query and index (that is, I expected to hit 0 buffers, until I
realized that we don't push ALL into index filters). I shall assume
ALL isn't used all that often (heh), but it sure feels like we're
missing out on performance here.

2. We also don't seem to support array keys for row compares, which
probably is an even more niche use case:

SELECT count(*)
FROM tenk1
WHERE (thousand, tenthous) = ANY (ARRAY[(1, 1), (1, 2), (2, 1)]);

This is no different from master, too, but it'd be nice if there was
support for arrays of row operations, too, just so that composite
primary keys can also be looked up with SAOPs.

Kind regards,

Matthias van de Meent

#23

Matthias van de Meent

boekewurm+postgres@gmail.com

about 2 years ago

In reply to: Peter Geoghegan (#20)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, 8 Nov 2023 at 02:53, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:

I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review.

Okay, thanks for the update, then I'll wait for v6 to be posted.

On second thought, I'll just post v6 now (there won't be conflicts
against the master branch once the other patch is committed anyway).

Thanks. Here's my review of the btree-related code:

+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* set flag to true if all required keys are satisfied and false
* otherwise.
*/
-        (void) _bt_checkkeys(scan, itup, indnatts, dir,
-                             &requiredMatchedByPrecheck, false);
+        _bt_checkkeys(scan, &pstate, itup, false, false);
+        requiredMatchedByPrecheck = pstate.continuescan;
+        pstate.continuescan = true; /* reset */

The comment above the updated section needs to be updated.

@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* set flag to true if all required keys are satisfied and false
* otherwise.
*/
-        (void) _bt_checkkeys(scan, itup, indnatts, dir,
-                             &requiredMatchedByPrecheck, false);
+        _bt_checkkeys(scan, &pstate, itup, false, false);

This 'false' finaltup argument is surely wrong for the rightmost
page's rightmost tuple, no?

+++ b/src/backend/access/nbtree/nbtutils.c
@@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
+            /* We could pfree(elem_values) after, but not worth the cycles */
+            num_elems = _bt_merge_arrays(scan, cur,
+                                         (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+                                         prev->elem_values, prev->num_elems,
+                                         elem_values, num_elems);

This code can get hit several times when there are multiple = ANY
clauses, which may result in repeated leakage of these arrays during
this scan. I think cleaning up may well be worth the cycles when the
total size of the arrays is large enough.

@@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
_bt_compare_array_elements, &cxt);
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+                 Datum *elems_orig, int nelems_orig,
+                 Datum *elems_next, int nelems_next)
[...]
+    /*
+     * Incrementally copy the original array into a temp buffer, skipping over
+     * any items that are missing from the "next" array
+     */

Given that we only keep the members that both arrays have in common,
the result array will be a strict subset of the original array. So, I
don't quite see why we need the temporary buffer here - we can reuse
the entries of the elems_orig array that we've already compared
against the elems_next array.

We may want to optimize this further by iterating over only the
smallest array: With the current code, [1, 2] + [1....1000] is faster
to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
smaller than 1000*log(2). In practice this may matter very little,
though.
An even better optimized version would do a merge join on the two
arrays, rather than loop + binary search.

@@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
[...]
+_bt_binsrch_array_skey(FmgrInfo *orderproc,

Is there a reason for this complex initialization of high/low_elem,
rather than the this easier to understand and more compact
initialization?:

+ low_elem = 0;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ {
+     if (ScanDirectionIsForward(dir))
+         low_elem = array->cur_elem;
+     else
+         high_elem = array->cur_elem;
+ }

@@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan)
[...]
+ _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
[...]
+    if (scan->parallel_scan != NULL)
+        _bt_parallel_done(scan);
+
+    /*
+     * No more primitive index scans.  Terminate the top-level scan.
+     */
+    return false;

I think the conditional _bt_parallel_done(scan) feels misplaced here,
as the comment immediately below indicates the scan is to be
terminated after that comment. So, either move this _bt_parallel_done
call outside the function (which by name would imply it is read-only,
without side effects like this) or move it below the comment
"terminate the top-level scan".

+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
[...]
+         * Set up ORDER 3-way comparison function and array state
[...]
+         * Optimization: Skip over non-required scan keys when we know that

These two sections should probably be swapped, as the skip makes the
setup useless.
Also, the comment here is wrong; the scan keys that are skipped are
'required', not 'non-required'.

+++ b/src/test/regress/expected/join.out
@@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
Merge Cond: (j1.id1 = j2.id1)
Join Filter: (j2.id2 = j1.id2)
->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)

I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter
anymore. The output otherwise matches (quite possibly because the
other join conditions don't match) and I don't have time to
investigate the intricacies between IOS vs normal IS, but this feels
off.

----

As for the planner changes, I don't think I'm familiar enough with the
planner to make any authorative comments on this. However, it does
look like you've changed the meaning of 'amsearcharray', and I'm not
sure it's OK to assume all indexes that support amsearcharray will
also support for this new assumption of ordered retrieval of SAOPs.
For one, the pgroonga extension [0]https://github.com/pgroonga/pgroonga/blob/115414723c7eb8ce9eb667da98e008bd10fbae0a/src/pgroonga.c#L8782-L8788 does mark
amcanorder+amsearcharray.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: https://github.com/pgroonga/pgroonga/blob/115414723c7eb8ce9eb667da98e008bd10fbae0a/src/pgroonga.c#L8782-L8788

#24

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Matthias van de Meent (#23)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Thanks. Here's my review of the btree-related code:

Attached is v7.

The main focus in v7 is making the handling of required
inequality-strategy scan keys more explicit -- now there is an
understanding of inequalities shared by _bt_check_compare (the
function that becomes the guts of _bt_checkkeys) and the new
_bt_advance_array_keys function/state machine. The big idea for v7 is
to generalize how we handle required equality-strategy scan keys
(always required in both scan directions), extending the same concept
to deal with required inequality strategy scan keys (only ever
required in one direction, which may or may not be the scan
direction).

This led to my discovering and fixing a couple of bugs related to
inequality handling. These issues were of the same general character
as many others I've dealt with before now: they involved subtle
confusion about when and how to start another primitive index scan,
leading to the scan reading many more pages than strictly necessary
(potentially many more than master). In other words, cases where we
didn't give up and start another primitive index scan, even though
(with a repro of the issue) it's obviously not sensible. An accidental
full index scan.

While I'm still not completely happy with the way that inequalities
are handled, things in this area are much improved in v7.

It should be noted that the patch isn't strictly guaranteed to always
read fewer index pages than master, for a given query plan and index.
This is by design. Though the patch comes close, it's not quite a
certainty. There are known cases where the patch reads the occasional
extra page (relative to what master would have done under the same
circumstances). These are cases where the implementation just cannot
know for sure whether the next/sibling leaf page has key space covered
by any of the scan's array keys (at least not in a way that seems
practical). The implementation has simple heuristics that infer (a
polite word for "make an educated guess") about what will be found on
the next page. Theoretically we could be more conservative in how we
go about this, but that seems like a bad idea to me. It's really easy
to find cases where the maximally conservative approach loses by a
lot, and really hard to show cases where it wins at all.

These heuristics are more or less a limited form of the heuristics
that skip scan would need. A *very* limited form. We're still
conservative. Here's how it works, at a high level: if the scan can
make it all the way to the end of the page without having to start a
new primitive index scan (before reaching the end), and then finds
that "finaltup" itself (which is usually the page high key) advances
the array keys, we speculate: we move on to the sibling page. It's
just about possible that we'll discover (once on the next page) that
finaltup actually advanced the array keys by so much (in one single
advancement step) that the current/new keys cover key space beyond the
sibling page we just arrived at. The sibling page access will have
been wasted (though I prefer to think of it as a cost of doing
business).

I go into a lot of detail on the trade-offs in this area in comments
at the end of the new _bt_checkkeys(), just after it calls
_bt_advance_array_keys(). Hopefully this is reasonably clear. It's
always much easier to understand these things when you've written lots
of test cases, though. So I wouldn't at all be surprised to hear that
my explanation needs more work. I suspect that I'm spending more time
on the topic than it actually warrants, but you have to spend a lot of
time on it for yourself to be able to see why that is.

+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* set flag to true if all required keys are satisfied and false
* otherwise.
*/
-        (void) _bt_checkkeys(scan, itup, indnatts, dir,
-                             &requiredMatchedByPrecheck, false);
+        _bt_checkkeys(scan, &pstate, itup, false, false);
+        requiredMatchedByPrecheck = pstate.continuescan;
+        pstate.continuescan = true; /* reset */

The comment above the updated section needs to be updated.

Updated.

@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* set flag to true if all required keys are satisfied and false
* otherwise.
*/
-        (void) _bt_checkkeys(scan, itup, indnatts, dir,
-                             &requiredMatchedByPrecheck, false);
+        _bt_checkkeys(scan, &pstate, itup, false, false);

This 'false' finaltup argument is surely wrong for the rightmost
page's rightmost tuple, no?

Not in any practical sense. Since finaltup means "the tuple that you
should use to decide whether to go to the next page or not", and a
rightmost page doesn't have a next page.

There are exactly two ways that the top-level scan can end (not to be
confused with the primitive scan), at least in v7. They are:

1. The state machine can exhaust the scan's array keys, ending the
top-level scan.

2. The scan can just run out of pages, without ever running out of
array keys (some array keys can sort higher than any real value from
the index). This is just like how an index scan ends when it lacks any
required scan keys to terminate the scan, and eventually runs out of
pages to scan (think of an index-only scan that performs a full scan
of the index, feeding into a group aggregate).

Note that it wouldn't be okay if the design relied on _bt_checkkeys
advancing and exhausting the array keys -- we really do need both 1
and 2 to deal with various edge cases. For example, there is no way
that we'll ever be able to call _bt_checkkeys with a completely empty
index. It simply doesn't have any tuples at all. In fact, it doesn't
even have any pages (apart from the metapage), so clearly we can't
expect any calls to _bt_readpage (much less _bt_checkkeys).

+++ b/src/backend/access/nbtree/nbtutils.c
@@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
+            /* We could pfree(elem_values) after, but not worth the cycles */
+            num_elems = _bt_merge_arrays(scan, cur,
+                                         (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+                                         prev->elem_values, prev->num_elems,
+                                         elem_values, num_elems);
This code can get hit several times when there are multiple = ANY
clauses, which may result in repeated leakage of these arrays during
this scan. I think cleaning up may well be worth the cycles when the
total size of the arrays is large enough.

They won't leak because the memory is allocated in the same dedicated
memory context.

That said, I added a pfree(). It couldn't hurt.

@@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
_bt_compare_array_elements, &cxt);
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+                 Datum *elems_orig, int nelems_orig,
+                 Datum *elems_next, int nelems_next)
[...]
+    /*
+     * Incrementally copy the original array into a temp buffer, skipping over
+     * any items that are missing from the "next" array
+     */
Given that we only keep the members that both arrays have in common,
the result array will be a strict subset of the original array. So, I
don't quite see why we need the temporary buffer here - we can reuse
the entries of the elems_orig array that we've already compared
against the elems_next array.

This code path is only hit when the query was written on autopilot,
since it must have contained redundant SAOPs for the same index column
-- a glaring inconsistency. Plus these arrays just aren't very big in
practice (despite my concerns about huge arrays). Plus there is only
one of these array-specific preprocessing steps per btrescan. So I
don't think that it's worth going to too much trouble here.

We may want to optimize this further by iterating over only the
smallest array: With the current code, [1, 2] + [1....1000] is faster
to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
smaller than 1000*log(2). In practice this may matter very little,
though.
An even better optimized version would do a merge join on the two
arrays, rather than loop + binary search.

v7 allocates the temp buffer using the size of whatever array is the
smaller of the two, just because it's an easy marginal improvement.

@@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
[...]
+_bt_binsrch_array_skey(FmgrInfo *orderproc,

Is there a reason for this complex initialization of high/low_elem,
rather than the this easier to understand and more compact
initialization?:

+ low_elem = 0;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ {
+     if (ScanDirectionIsForward(dir))
+         low_elem = array->cur_elem;
+     else
+         high_elem = array->cur_elem;
+ }

I agree that it's better your way. Done that way in v7.

I think the conditional _bt_parallel_done(scan) feels misplaced here,
as the comment immediately below indicates the scan is to be
terminated after that comment. So, either move this _bt_parallel_done
call outside the function (which by name would imply it is read-only,
without side effects like this) or move it below the comment
"terminate the top-level scan".

v7 moves the comment up, so that it's just before the _bt_parallel_done() call.

+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
[...]
+         * Set up ORDER 3-way comparison function and array state
[...]
+         * Optimization: Skip over non-required scan keys when we know that

These two sections should probably be swapped, as the skip makes the
setup useless.

Not quite: we need to increment arrayidx for later loop iterations/scan keys.

Also, the comment here is wrong; the scan keys that are skipped are
'required', not 'non-required'.

Agreed. Fixed.

+++ b/src/test/regress/expected/join.out
@@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
Merge Cond: (j1.id1 = j2.id1)
Join Filter: (j2.id2 = j1.id2)
->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter
anymore. The output otherwise matches (quite possibly because the
other join conditions don't match) and I don't have time to
investigate the intricacies between IOS vs normal IS, but this feels
off.

This happens because the new plan uses a completely different index --
which happens to be a partial index whose predicate exactly matches
the old plan's filter quals. That factor makes the filter quals
unnecessary. That's all this is.

As for the planner changes, I don't think I'm familiar enough with the
planner to make any authorative comments on this. However, it does
look like you've changed the meaning of 'amsearcharray', and I'm not
sure it's OK to assume all indexes that support amsearcharray will
also support for this new assumption of ordered retrieval of SAOPs.
For one, the pgroonga extension [0] does mark
amcanorder+amsearcharray.

The changes that I've made to the planner are subtractive. We more or
less go back to how things were just after the initial work on nbtree
amsearcharray support. That work was (at least tacitly) assumed to
have no impact on ordered scans. Because why should it? What other
type of index clause has ever affected what seems like a rather
unrelated thing (namely the sort order of the scan)? The oversight was
understandable. The kinds of plans that master cannot produce output
for in standard index order are really silly plans, independent of
this issue; it makes zero sense to allow a non-required array scan key
to affect how or when we skip.

The code that I'm removing from the planner is code that quite
obviously assumes nbtree-like behavior. So I'm taking away code like
that, rather than adding new code like that. That said, I am really
surprised that any extension creates an index AM amcanorder=true (not
to be confused with amcanorderbyop=true, which is less surprising).
That means that it promises the planner that it behaves just like
nbtree. To quote the docs, it must have "btree-compatible strategy
numbers for their [its] equality and ordering operators". Is that
really something that pgroonga even attempts? And if so, why?

I also find it bizarre that pgroonga's handler-stated capabilities
include "amcanunique=true". So pgroonga is a full text search engine,
but also supports unique indexes? I find that particularly hard to
believe, and suspect that the way that they set things up in the AM
handler just isn't very well thought out.

--
Peter Geoghegan

Attachments:

v7-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v7-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 8e3db71c09aa1ecad1a90a9ec8b4cdbd38c37097 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v7] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The array state machine now advances using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   42 +-
 src/backend/access/nbtree/nbtree.c         |   63 +-
 src/backend/access/nbtree/nbtsearch.c      |   84 +-
 src/backend/access/nbtree/nbtutils.c       | 1727 +++++++++++++++++++-
 src/backend/optimizer/path/indxpath.c      |   86 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   13 +
 src/test/regress/expected/create_index.out |   61 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   20 +-
 10 files changed, 1932 insertions(+), 291 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086..566e1c15d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -965,7 +965,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	bool		needPrimScan;	/* Perform another primitive scan? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1083,6 +1083,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the final tuple from the page.  This must happen
+ * before the first call to _bt_checkkeys.  _bt_checkkeys uses the final tuple
+ * to manage advancement of the scan's array keys more efficiently.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* final tuple (high key for forward scans) */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		finaltupchecked;	/* final tuple checked against current
+									 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1090,6 +1113,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_RDDNARRAY	0x00040000	/* redundant in array preprocessing */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1160,7 +1184,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1253,12 +1277,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup,
 						  bool requiredMatchedByPrecheck);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a88b36a58..6328a8a63 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -275,8 +275,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -333,8 +333,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	so->firstPage = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
@@ -588,7 +590,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -614,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -625,7 +627,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -656,16 +662,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -697,9 +704,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -733,12 +743,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -752,14 +761,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -768,13 +777,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5..834012514 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -1537,9 +1537,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
+	BTReadPageState pstate;
+	int			numArrayKeys,
+				itemIndex;
 	bool		requiredMatchedByPrecheck;
 
 	/*
@@ -1560,8 +1560,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.finaltupchecked = false;
+	numArrayKeys = so->numArrayKeys;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1609,9 +1613,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 * the last item on the page would give a more precise answer.
 	 *
 	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * slowdown of point queries.  Never apply the optimization with a scans
+	 * that uses array keys, either, since that breaks certain assumptions.
+	 * (Our search-type scan keys change whenever _bt_checkkeys advances the
+	 * arrays, invalidating any precheck.  Tracking all that would be tricky.)
 	 */
-	if (!so->firstPage && minoff < maxoff)
+	if (!so->firstPage && !numArrayKeys && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1625,8 +1632,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * set flag to true if all required keys are satisfied and false
 		 * otherwise.
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &requiredMatchedByPrecheck, false);
+		_bt_checkkeys(scan, &pstate, itup, false, false);
+		requiredMatchedByPrecheck = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 	else
 	{
@@ -1636,6 +1644,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1659,8 +1675,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, false,
+										 requiredMatchedByPrecheck);
 
 			/*
 			 * If the result of prechecking required keys was true, then in
@@ -1668,8 +1684,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * result is the same.
 			 */
 			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+												 false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
@@ -1703,7 +1719,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1720,17 +1736,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
-			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
-			int			truncatt;
+			IndexTuple	itup;
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
+			itup = (IndexTuple) PageGetItem(page, iid);
+			_bt_checkkeys(scan, &pstate, itup, true, false);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1740,6 +1755,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (numArrayKeys && minoff <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1751,6 +1774,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1761,12 +1785,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1778,8 +1808,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+										 requiredMatchedByPrecheck);
 
 			/*
 			 * If the result of prechecking required keys was true, then in
@@ -1787,8 +1817,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * result is the same.
 			 */
 			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup,
+												 finaltup, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1827,7 +1857,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1510b97fb..4d8e33a4d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,41 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
+static int	_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *final_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple, int sktrig);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int sktrig);
+static void _bt_update_keys_with_arraykeys(IndexScanDesc scan);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool *continuescan, int *sktrig,
+							  bool requiredMatchedByPrecheck);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -198,13 +224,48 @@ _bt_freestack(BTStack stack)
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
  * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * array elements.
+ *
+ * _bt_preprocess_keys treats each primitive scan as an independent piece of
+ * work.  That structure pushes the responsibility for preprocessing that must
+ * work "across array keys" onto us.  This division of labor makes sense once
+ * you consider that we're typically called no more than once per btrescan,
+ * whereas _bt_preprocess_keys is always called once per primitive index scan.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array key as redundant.  Similarly, we are capable of "merging
+ * together" multiple equality array keys from two or more input scan keys
+ * into a single output scan key that contains only the intersecting array
+ * elements.  This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.
+ *
+ * Note: _bt_start_array_keys actually sets up the cur_elem counters later on,
+ * once the scan direction is known.
  *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
+ *
+ * Note: _bt_preprocess_keys is responsible for creating the so->keyData scan
+ * keys used by _bt_checkkeys.  Index scans that don't use equality array keys
+ * will have _bt_preprocess_keys treat scan->keyData as input and so->keyData
+ * as output.  Scans that use equality array keys have _bt_preprocess_keys
+ * treat so->arrayKeyData (which is our output) as their input, while (as per
+ * usual) outputting so->keyData for _bt_checkkeys.  This function adds an
+ * additional layer of indirection that allows _bt_preprocess_keys to more or
+ * less avoid dealing with SK_SEARCHARRAY as a special case.
+ *
+ * Note: _bt_update_keys_with_arraykeys works by updating already-processed
+ * output keys (so->keyData) in-place.  It cannot eliminate redundant or
+ * contradictory scan keys.  This necessitates having _bt_preprocess_keys
+ * understand that it is unsafe to eliminate "redundant" SK_SEARCHARRAY
+ * equality scan keys on the basis of what is actually just the current array
+ * key values -- it must conservatively assume that such a scan key might no
+ * longer be redundant after the next _bt_update_keys_with_arraykeys call.
+ * Ideally we'd be able to deal with that by eliminating a subset of truly
+ * redundant array keys up-front, but it doesn't seem worth the trouble.
  */
 void
 _bt_preprocess_array_keys(IndexScanDesc scan)
@@ -212,7 +273,9 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
+	int			lastEqualityArrayAtt = -1;
 	ScanKey		cur;
 	int			i;
 	MemoryContext oldContext;
@@ -265,6 +328,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc0(nkeyatts * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -281,6 +345,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.   Set
+		 * that up now.  (Avoids repeating work for the same attribute.)
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			!OidIsValid(so->orderProcs[cur->sk_attno - 1].fn_oid))
+			_bt_sort_array_cmp_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -357,6 +431,47 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and redundant whole scan keys)
+		 */
+		if (lastEqualityArrayAtt == cur->sk_attno)
+		{
+			BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+			Assert(so->arrayKeyData[prev->scan_key].sk_func.fn_oid ==
+				   cur->sk_func.fn_oid);
+			Assert(so->arrayKeyData[prev->scan_key].sk_subtype ==
+				   cur->sk_subtype);
+
+			num_elems = _bt_merge_arrays(scan, cur,
+										 (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+										 prev->elem_values, prev->num_elems,
+										 elem_values, num_elems);
+
+			pfree(elem_values);
+
+			/*
+			 * If there are no intersecting elements left from merging this
+			 * array into the previous array on the same attribute, the scan
+			 * qual is unsatisfiable
+			 */
+			if (num_elems == 0)
+			{
+				numArrayKeys = -1;
+				break;
+			}
+
+			/*
+			 * Lower the number of elements from the previous array, and mark
+			 * this scan key/array as redundant for every primitive index scan
+			 */
+			prev->num_elems = num_elems;
+			cur->sk_flags |= SK_BT_RDDNARRAY;
+			continue;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
 		 */
@@ -364,6 +479,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		lastEqualityArrayAtt = cur->sk_attno;
 	}
 
 	so->numArrayKeys = numArrayKeys;
@@ -437,26 +553,28 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 }
 
 /*
- * _bt_sort_array_elements() -- sort and de-dup array elements
+ * _bt_sort_array_cmp_setup() -- Look up array comparison function
  *
- * The array elements are sorted in-place, and the new number of elements
- * after duplicate removal is returned.
- *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * Sets so->orderProcs[] for scan key's attribute.  This is used to sort and
+ * deduplicate the attribute's array (if any).  It's also used during binary
+ * searches of the next array key matching index tuples just beyond the range
+ * of the scan's current set of array keys.
  */
-static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
-						Datum *elems, int nelems)
+static void
+_bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey)
 {
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 	Oid			elemtype;
 	RegProcedure cmp_proc;
-	BTSortArrayContext cxt;
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
 
-	if (nelems <= 1)
-		return nelems;			/* no work to do */
+	/*
+	 * Should do this for all equality strategy scan keys only (including
+	 * those without any array).  See _bt_advance_array_keys for details of
+	 * why we need an ORDER proc for non-array equality strategy scan keys.
+	 */
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
 
 	/*
 	 * Determine the nominal datatype of the array elements.  We have to
@@ -471,12 +589,10 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 	 * Look up the appropriate comparison function in the opfamily.
 	 *
 	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
+	 * incomplete.
 	 */
 	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
+								 rel->rd_opcintype[skey->sk_attno - 1],
 								 elemtype,
 								 BTORDER_PROC);
 	if (!RegProcedureIsValid(cmp_proc))
@@ -484,8 +600,32 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 			 BTORDER_PROC, elemtype, elemtype,
 			 rel->rd_opfamily[skey->sk_attno - 1]);
 
+	/* Save in orderproc entry for attribute */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
+/*
+ * _bt_sort_array_elements() -- sort and de-dup array elements
+ *
+ * The array elements are sorted in-place, and the new number of elements
+ * after duplicate removal is returned.
+ *
+ * scan and skey identify the index column, whose opfamily determines the
+ * comparison semantics.  If reverse is true, we sort in descending order.
+ */
+static int
+_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
+						bool reverse,
+						Datum *elems, int nelems)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+
+	if (nelems <= 1)
+		return nelems;			/* no work to do */
+
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -496,6 +636,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	int			merged_nelems = 0;
+
+	/*
+	 * Incrementally copy the original array into a temp buffer, skipping over
+	 * any items that are missing from the "next" array
+	 */
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0; i < nelems_orig; i++)
+	{
+		Datum	   *elem = elems_orig + i;
+
+		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+						_bt_compare_array_elements, &cxt))
+			merged[merged_nelems++] = *elem;
+	}
+
+	/*
+	 * Overwrite the original array with temp buffer so that we're only left
+	 * with intersecting array elements
+	 */
+	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	pfree(merged);
+
+	return merged_nelems;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -507,7 +689,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -515,6 +697,158 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * compared the returned array element to caller's tupdatum argument.  This
+ * helps caller to decide what to do next.  Caller should only accept the
+ * element we locate as-is when it's an exact match (i.e. *final_result is 0).
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  This (and information about the scan's direction) allows
+ * searches against required scan key arrays to reuse earlier search bounds.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *final_result)
+{
+	int			low_elem,
+				mid_elem,
+				high_elem,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	low_elem = 0;
+	mid_elem = -1;
+	high_elem = array->num_elems - 1;
+	if (cur_elem_start)
+	{
+		if (ScanDirectionIsForward(dir))
+			low_elem = array->cur_elem;
+		else
+			high_elem = array->cur_elem;
+	}
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * it's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the array element we'll return.  We must set *final_result
+	 * with the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*final_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -539,30 +873,35 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array is still at its final element for scan direction).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		found = false;
-	int			i;
+
+	Assert(!so->needPrimScan);
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
@@ -596,19 +935,31 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+	if (found)
+		return true;
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * Don't allow the entire set of array keys to roll over: restore the
+	 * array keys to the state they were in before we were called.
+	 *
+	 * This ensures that the array keys only ratchet forward (or backwards in
+	 * the case of backward scans).  Our "so->arrayKeyData[]" scan keys should
+	 * always match the current "so->keyData[]" search-type scan keys (except
+	 * for a brief moment during array key advancement).
 	 */
-	if (!found)
-		so->arraysStarted = false;
+	for (int i = 0; i < so->numArrayKeys; i++)
+	{
+		BTArrayKeyInfo *rollarray = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[rollarray->scan_key];
 
-	return found;
+		if (ScanDirectionIsBackward(dir))
+			rollarray->cur_elem = 0;
+		else
+			rollarray->cur_elem = rollarray->num_elems - 1;
+		skey->sk_argument = rollarray->elem_values[rollarray->cur_elem];
+	}
+
+	return false;
 }
 
 /*
@@ -661,20 +1012,845 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
 	 * sound like overkill, but in cases with multiple keys per index column
 	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
 	 */
-	if (changed || !so->arraysStarted)
-	{
+	if (changed)
 		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
 }
 
+/*
+ * _bt_tuple_before_array_skeys() -- _bt_checkkeys array helper function
+ *
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) must advance the scan's array keys.
+ * Only call here when _bt_check_compare already set continuescan=false.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfying our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it is now time for our
+ * caller to advance the array keys (unless caller broke the rules by not
+ * checking with _bt_check_compare before calling here).
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * _bt_advance_array_keys and its handling of inequalities for details.
+ *
+ * Note: caller passes _bt_check_compare-set sktrig value to indicate which
+ * scan key triggered the call.  If this is for any scan key that isn't a
+ * required equality strategy scan key, calling here is a no-op, meaning that
+ * we'll invariably return false.  We just accept whatever _bt_check_compare
+ * indicated about the scan when it involves a required inequality scan key.
+ * We never care about nonrequired scan keys, including equality strategy
+ * array scan keys (though _bt_check_compare can temporarily end the scan to
+ * advance their ararys in _bt_advance_array_keys, which we'll never prevent).
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	bool		tuple_before_array_keys = false;
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData + sktrig, ikey = sktrig;
+		 ikey < so->numberOfKeys;
+		 cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
+
+		/*
+		 * Unlike _bt_check_compare and _bt_advance_array_keys, we never deal
+		 * with inequality strategy scan keys (even those marked required). We
+		 * also don't deal with non-required equality keys -- even when they
+		 * happen to have arrays that might need to be advanced.
+		 *
+		 * Note: cannot "break" here due to corner cases involving redundant
+		 * scan keys that weren't eliminated within _bt_preprocess_keys.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			(cur->sk_flags & SK_BT_REQFWD) == 0)
+			continue;
+
+		/* Required equality scan keys always required in both directions */
+		Assert((cur->sk_flags & SK_BT_REQFWD) &&
+			   (cur->sk_flags & SK_BT_REQBKWD));
+
+		if (attnum > ntupatts)
+		{
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys, forcing another _bt_advance_array_keys call.
+			 *
+			 * You might wonder why we don't treat truncated attributes as
+			 * having values < our equality constraints instead; we're not
+			 * treating the truncated attributes as having -inf values here,
+			 * which is how things are done in _bt_compare.
+			 *
+			 * We're often called during finaltup prechecks, where we help our
+			 * caller to decide whether or not it should terminate the current
+			 * primitive index scan.  Our behavior here implements a policy of
+			 * being slightly optimistic about what will be found on the next
+			 * page when the current primitive scan continues onto that page.
+			 * (This is also closest to what _bt_check_compare does.)
+			 */
+			break;
+		}
+
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		if (result != 0)
+		{
+			if (ScanDirectionIsForward(dir))
+				tuple_before_array_keys = result < 0;
+			else
+				tuple_before_array_keys = result > 0;
+
+			break;
+		}
+	}
+
+	return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  This tells us what to do.  We cannot rely on _bt_first
+	 * always reaching _bt_checkkeys, though.  There are various cases where
+	 * that won't happen.  For example, if the index is completely empty, then
+	 * _bt_first won't get as far as calling _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys...
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * ...though there is one exception: _bt_first's _bt_preprocess_keys
+		 * call can determine that the scan's input scan keys can never be
+		 * satisfied.  That might be true for one set of array keys, but not
+		 * the next set.
+		 *
+		 * Handle this by advancing the array keys incrementally ourselves.
+		 * When this succeeds, start another primitive index scan.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		Assert(!so->needPrimScan);
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+
+		/* Array keys are now exhausted */
+	}
+
+	/*
+	 * Has another primitive index scan been scheduled by _bt_checkkeys?
+	 */
+	if (so->needPrimScan)
+	{
+		/* Yes -- tell caller to call _bt_first once again */
+		so->needPrimScan = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Terminate the top-level scan.
+	 */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Like _bt_check_compare, our return value indicates if tuple satisfied the
+ * qual (specifically our new qual).  We also set pstate.continuescan=false
+ * for caller when the top-level index scan is over (when all required array
+ * keys are now exhausted).  Otherwise, we'll set pstate.continuescan=true,
+ * indicating that top-level scan should proceed onto the next tuple.  After
+ * we return, all further calls to _bt_check_compare will also use our new
+ * qual (a qual with newly advanced array key values, set here by us).
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys.  Calling here before that
+ * point will prematurely advance the array keys, leading to wrong query
+ * results.  (Actually, the case where the top-level scan ends might not
+ * advance the array keys, since there may be no further keys in the current
+ * scan direction.)
+ *
+ * We're responsible for ensuring that caller's tuple is <= current/newly
+ * advanced required array keys once we return (this postcondition is also
+ * checked via another assertion).  We try to find an exact match, but failing
+ * that we'll advance the array keys to whatever set of keys comes next in the
+ * key space (among the keys that we actually have).  Required array keys only
+ * ever "ratchet forwards", progressing in lock step with the scan itself.
+ *
+ * (The invariants are the same for backwards scans, except that the operators
+ * are flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * Note that we deal with all required equality strategy scan keys here; it's
+ * not limited to array scan keys.  They're equality constraints for our
+ * purposes, and so are handled as degenerate single element arrays here.
+ * Obviously, they can never really advance in the way that real arrays can,
+ * but they must still affect how we advance real array scan keys, just like
+ * any other equality constraint.  We have to keep around a 3-way ORDER proc
+ * for these (just using the "=" operator won't do), since in general whether
+ * the tuple is < or > some non-array equality key might influence advancement
+ * of any of the scan's actual arrays.  The top-level scan can only terminate
+ * after it has processed the key space covered by the product of each and
+ * every equality constraint, including both non-arrays and (required) arrays.
+ * (Also, _bt_tuple_before_array_skeys needs to know the difference so that it
+ * can correctly suppress _bt_check_compare setting continuescan=false.)
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple according to _bt_check_compare.  This is how we
+ * deal with inequalities that are required in the current scan direction.
+ * They can advance the array keys here, even though they don't influence the
+ * initial positioning strategy within _bt_first.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				first_nonrequired_ikey PG_USED_FOR_ASSERTS_ONLY = -1,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_exhausted,
+				beyond_end_advance = false,
+				foundRequiredOppositeDirOnly = false,
+				all_eqtype_sk_equal = true,
+				all_required_eqtype_sk_equal PG_USED_FOR_ASSERTS_ONLY = true;
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	/*
+	 * Try to advance array keys via a series of binary searches.
+	 *
+	 * Loop iterates through the current scankeys (so->keyData[], which were
+	 * output by _bt_preprocess_keys earlier) and then sets input scan keys
+	 * (so->arrayKeyData[] scan keys) to new array values.
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno;
+		Datum		tupdatum;
+		bool		requiredSameDir = false,
+					requiredOppositeDirOnly = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		/*
+		 * Set up ORDER 3-way comparison function and array state
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		if (cur->sk_flags & SK_SEARCHARRAY &&
+			cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+			Assert(skeyarray->sk_attno == attnum);
+		}
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			requiredSameDir = true;
+		else if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
+				 ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
+			requiredOppositeDirOnly = true;
+
+		/*
+		 * Remember first non-required array scan key offset (for assertions)
+		 */
+		if (!requiredSameDir && array && first_nonrequired_ikey == -1)
+			first_nonrequired_ikey = ikey;
+
+		/*
+		 * Optimization: Skip over known-satisfied scan keys
+		 */
+		if (ikey < sktrig)
+			continue;
+
+		/*
+		 * When we come across an inequality scan key that's required in the
+		 * opposite direction only, and is positioned after an unsatisfied
+		 * scan key that's required in the current scan direction, remember it
+		 */
+		if (requiredOppositeDirOnly)
+		{
+			Assert(ikey > sktrig);
+			Assert(cur->sk_strategy != BTEqualStrategyNumber);
+			Assert(!foundRequiredOppositeDirOnly);
+
+			foundRequiredOppositeDirOnly = true;
+
+			continue;
+		}
+
+		/*
+		 * Other than that, we're not interested in scan keys that aren't
+		 * required in the current scan direction (unless they're non-required
+		 * array equality scan keys, which still need to be advanced by us)
+		 */
+		if (!requiredSameDir && !array)
+			continue;
+
+		/*
+		 * Whenever a required scan key triggers array key advancement within
+		 * _bt_check_compare, the corresponding tuple attribute's value is
+		 * typically < the scan key value (or > in the backwards scan case).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; we know that _bt_tuple_before_array_skeys has already
+		 * determined that this scan key places us ahead of caller's tuple.
+		 * There's no need to compare it a second time below.
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; it knows all the intricacies around
+		 * evaluating inequality strategy scan keys (e.g., row comparisons).
+		 * There is no simple mapping onto the opclass ORDER proc we can use.
+		 * But once we know that we have an unsatisfied inequality, we can
+		 * treat it in the same way as an unsatisfied equality at this point.
+		 *
+		 * The arrays advance correctly in both cases because both involve the
+		 * scan reaching the end of the key space for some array key (or some
+		 * distinct set of array keys).  The only difference is that in the
+		 * equality strategy case the end is "between array keys", while in
+		 * the inequality strategy case the end is "within an array key".
+		 * Either way, we just advance higher order arrays by one increment.
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(requiredSameDir);
+			Assert(!arrays_advanced);
+
+			beyond_end_advance = true;
+
+			continue;
+		}
+
+		/*
+		 * Nothing for us to do with a required inequality strategy scan key
+		 * that wasn't the one that _bt_check_compare stopped on
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 *
+		 * NB: We must do this for all arrays -- not just required arrays.
+		 * Otherwise the incremental array advancement step won't "carry".
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				skeyarray->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * required scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_eqtype_sk_equal || attnum > ntupatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			/*
+			 * Truncated -inf value will always be assumed to satisfy any
+			 * required equality scan keys according to _bt_check_compare.
+			 * Unset all_eqtype_sk_equal to avoid _bt_check_compare recheck.
+			 *
+			 * Deliberately don't unset all_required_eqtype_sk_equal here to
+			 * avoid spurious postcondition assertion failures.  We must
+			 * follow _bt_tuple_before_array_skeys's example by not treating
+			 * truncated attributes as having the exact value -inf.
+			 */
+			all_eqtype_sk_equal = false;
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		ratchets = (requiredSameDir && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+											  tupdatum, tupnull, array, cur,
+											  &result);
+
+			/*
+			 * Required arrays only ever ratchet forwards (backwards).
+			 *
+			 * This condition makes it safe for binary searches to skip over
+			 * array elements that the scan must already be ahead of by now.
+			 * That is strictly an optimization.  Our assertion verifies that
+			 * the condition holds, which doesn't depend on the optimization.
+			 */
+			Assert(!ratchets ||
+				   ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+					(ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(requiredSameDir);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array.
+			 *
+			 * _bt_advance_array_keys_increment won't have an array for this
+			 * scan key, but it can't matter.  If you think about how real
+			 * single value arrays roll over, you'll understand why this is.
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * See also: state machine postcondition assertions, below.
+		 *
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just used for this array happens to have been the final element
+		 * (for current scan direction).  We can't just increment (decrement)
+		 * set_elem itself and expect correct behavior -- at least not when
+		 * there's more than one array to consider.
+		 *
+		 * Our approach is to set each subsequent/lower-order array to its
+		 * final element.  We'll then advance all array keys incrementally,
+		 * just outside the loop.  That way all earlier/higher order arrays
+		 * (arrays _before_ this one) will advance as needed by rolling over.
+		 *
+		 * The array keys advance a little like the way that a mileage gauge
+		 * advances.  Imagine a mechanical display that rolls over from 999 to
+		 * 000 every time we drive our car another 1,000 miles.  Each decimal
+		 * digit behaves a little like an array from the array state machine
+		 * implemented by this function.  (_bt_advance_array_keys_increment
+		 * won't actually allow the most significant array to roll over, but
+		 * that's just defensive.)
+		 *
+		 * Suppose we have 3 array keys a, b, and c.  Each "digit"/array has
+		 * 10 distinct elements that happen to match across each array: values
+		 * 0 through to 9.  Caller's tuple "(a, b, c) = (3, 7.9, 2)" might
+		 * initially have its "b" array advanced up to the value 7 (because 7
+		 * was matched by its binary search), and its "c" array advanced to 9.
+		 * The final incremental advancement step (outside the loop) will then
+		 * finish things off by "advancing" the array on "c" to 0, which then
+		 * carries over to "b" (since "c" rolled over when it advanced).  Once
+		 * we're done we'll have "rounded up from 7.9 to 8" for the "b" array,
+		 * without needing to directly alter its set_elem.
+		 *
+		 * The "a" array won't have advanced on this occasion, since the "b"
+		 * array didn't roll over in turn.  But it would given a tuple like
+		 * "(a, b, c) = (3, 9.9, 4)".  A tuple like "(a, b, c) = (9, 9.9, 8)"
+		 * will eventually try (though fail) to roll over the array on "a".
+		 * Failing to roll over everything like this exhausts all the arrays.
+		 *
+		 * Under this scheme required array keys only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 */
+		if (requiredSameDir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		/*
+		 * Also track whether all relevant attributes from caller's tuple will
+		 * be equal to the scan's array keys once we're done with it
+		 */
+		if (result != 0)
+		{
+			all_eqtype_sk_equal = false;
+			if (requiredSameDir)
+				all_required_eqtype_sk_equal = false;
+		}
+
+		/*
+		 * Optimization: If this call was triggered by a non-required array,
+		 * and we know that tuple won't satisfy the qual, we give up right
+		 * away.  This often avoids advancing the array keys, which will save
+		 * wasted cycles from calling _bt_update_keys_with_arraykeys below
+		 * (plus it avoids needlessly unsetting pstate.finaltupchecked).
+		 */
+		if (!all_eqtype_sk_equal && !requiredSameDir && sktrig == ikey)
+		{
+			Assert(!arrays_advanced);
+			Assert(!foundRequiredOppositeDirOnly);
+
+			break;
+		}
+
+		/* Advance array keys, even if set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+		}
+	}
+
+	/*
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted.
+	 */
+	arrays_exhausted = false;
+	if (beyond_end_advance)
+	{
+		/* Non-required scan keys never exhaust arrays/end top-level scan */
+		Assert(sktrig < first_nonrequired_ikey ||
+			   first_nonrequired_ikey == -1);
+
+		if (!_bt_advance_array_keys_increment(scan, dir))
+			arrays_exhausted = true;
+		else
+			arrays_advanced = true;
+
+		/*
+		 * The newly advanced array keys won't be equal anymore, so remember
+		 * that in order to avoid a second _bt_check_compare call for tuple
+		 */
+		all_eqtype_sk_equal = all_required_eqtype_sk_equal = false;
+	}
+
+	if (arrays_advanced)
+	{
+		/*
+		 * We advanced the array keys.  Finalize everything by performing an
+		 * in-place update of the scan's search-type scan keys.
+		 *
+		 * If we missed this final step then any call to _bt_check_compare
+		 * would use stale array keys until such time as _bt_preprocess_keys
+		 * was once again called by _bt_first.
+		 */
+		_bt_update_keys_with_arraykeys(scan);
+
+		/*
+		 * If any required array keys were advanced, be prepared to recheck
+		 * the final tuple against the new array keys (as an optimization)
+		 */
+		pstate->finaltupchecked = false;
+	}
+
+	/*
+	 * State machine postcondition assertions.
+	 *
+	 * Tuple must now be <= current/newly advanced required array keys.  Same
+	 * goes for other required equality type scan keys, which are "degenerate
+	 * single value arrays" for our purposes.  (As usual the rule is the same
+	 * for backwards scans once the operators are flipped around.)
+	 *
+	 * We're stricter than that in cases where the tuple was already equal to
+	 * the previous array keys when we were called: tuple must now be < the
+	 * new array keys (or > the array keys).  This is a consequence of another
+	 * rule: we must always advance the array keys by at least one increment
+	 * (unless _bt_advance_array_keys_increment found that we'd exhausted all
+	 * arrays, ending the top-level index scan).
+	 *
+	 * Our caller decides when to start primitive index scans based in part on
+	 * the current array keys.  It always needs to see a precise array-wise
+	 * picture of the scan's progress.  If we were to advance the array keys
+	 * by less than the exact maximum safe amount, our caller might then make
+	 * a subtly wrong decision about when to end the ongoing primitive scan.
+	 * (These assertions won't reliably detect every case where the array keys
+	 * haven't advanced by the expected/maximum amount, but they come close.)
+	 */
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	Assert(arrays_exhausted ||
+		   (_bt_tuple_before_array_skeys(scan, pstate, tuple, 0) ==
+			!all_required_eqtype_sk_equal));
+
+	/*
+	 * If the array keys are now exhausted, end the top-level index scan
+	 */
+	Assert(!so->needPrimScan);
+	if (arrays_exhausted)
+	{
+		/* Caller's tuple can't match new qual */
+		pstate->continuescan = false;
+		return false;
+	}
+
+	/*
+	 * The array keys aren't exhausted, so provisionally assume that the
+	 * current primitive index scan will continue
+	 */
+	pstate->continuescan = true;
+
+	/*
+	 * Does caller's tuple now match the new qual?  Call _bt_check_compare a
+	 * second time to find out (unless it's already clear that it can't).
+	 */
+	if (all_eqtype_sk_equal)
+	{
+		bool		continuescan;
+		int			insktrig;
+
+		Assert(arrays_advanced);
+
+		if (likely(_bt_check_compare(dir, so, tuple, ntupatts, itupdesc,
+									 &continuescan, &insktrig, false)))
+			return true;
+
+		/*
+		 * Handle inequalities marked required in the current scan direction.
+		 *
+		 * It's just about possible that our _bt_check_compare call indicates
+		 * that the scan should be terminated due to an unsatisfied inequality
+		 * that wasn't initially recognized as such by us.  Handle this by
+		 * calling ourselves recursively while indicating that the trigger is
+		 * now the inequality that we missed first time around.
+		 *
+		 * Note: we only need to do this in cases where the initial call to
+		 * _bt_check_compare (that led to calling here) gave up upon finding
+		 * an unsatisfied required equality/array scan key before it could
+		 * reach the inequality.  The second _bt_check_compare call took place
+		 * after the array keys were advanced (to array keys that definitely
+		 * match the tuple), so it can't have been overlooked a second time.
+		 *
+		 * Note: this is useful because we won't have to wait until the next
+		 * tuple to advance the array keys a second time (to values that'll
+		 * put the scan ahead of this tuple).  Handling this ourselves isn't
+		 * truly required.  But it avoids complicating our contract.  The only
+		 * alternative is to allow an awkward exception to the general rule
+		 * (the rule about always advancing the arrays to the maximum possible
+		 * extent that caller's tuple can safely allow).
+		 */
+		if (!continuescan)
+		{
+			Assert(insktrig > sktrig);
+			Assert(insktrig < first_nonrequired_ikey ||
+				   first_nonrequired_ikey == -1);
+			return _bt_advance_array_keys(scan, pstate, tuple, insktrig);
+		}
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 *
+	 * If we advanced the array keys (which is now certain except in the case
+	 * where we only needed to deal with non-required arrays), it's possible
+	 * that the scan is now at the start of "matching" tuples (at least by the
+	 * definition used by _bt_tuple_before_array_skeys), but is nevertheless
+	 * still many leaf pages before the position that _bt_first is capable of
+	 * repositioning the scan to.
+	 *
+	 * This can happen when we have an inequality scan key required in the
+	 * opposite direction only, that's less significant than the scan key that
+	 * triggered array advancement during our initial _bt_check_compare call.
+	 * If even finaltup doesn't satisfy this less significant inequality scan
+	 * key once we temporarily flip the scan direction, that indicates that
+	 * even finaltup is before the _bt_first-wise initial position for these
+	 * newly advanced array keys.
+	 */
+	if (foundRequiredOppositeDirOnly && pstate->finaltup &&
+		!_bt_tuple_before_array_skeys(scan, pstate, pstate->finaltup, 0))
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped = -dir;
+		bool		continuescan;
+		int			opsktrig;
+
+		Assert(arrays_advanced);
+
+		_bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+						  itupdesc, &continuescan, &opsktrig, false);
+
+		if (!continuescan && opsktrig > sktrig)
+		{
+			/*
+			 * Continuing the ongoing primitive index scan as-is risks
+			 * uselessly scanning a huge number of leaf pages from before the
+			 * page that we'll quickly jump to by descending the index anew.
+			 *
+			 * Play it safe: start a new primitive index scan.  _bt_first is
+			 * guaranteed to at least move the scan to the next leaf page.
+			 */
+			Assert(opsktrig < first_nonrequired_ikey ||
+				   first_nonrequired_ikey == -1);
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+
+			return false;
+		}
+
+		/*
+		 * Caller's tuple might still be before the _bt_first-wise start of
+		 * matches for the new array keys, but at least finaltup is at or
+		 * ahead of that position.  That's good enough; continue as-is.
+		 */
+	}
+
+	/*
+	 * Caller's tuple is < the newly advanced array keys (or > when this is a
+	 * backwards scan).
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.  Our caller should locate
+	 * the first tuple >= the array keys before long (or locate the first
+	 * tuple <= the array keys before long).
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will either quickly discover the next tuple
+	 * covered by the current array keys, or quickly discover that it needs
+	 * another primitive index scan (using its finaltup precheck) instead.
+	 * Either way, a binary search is unlikely to beat a simple linear search.
+	 *
+	 * It's also not clear that a binary search will be any faster when we
+	 * really do have to search through hundreds of tuples beyond this one.
+	 * Several binary searches (one per array advancement) might be required
+	 * while reading through a single page.  Our linear search is structured
+	 * as one continuous search that just advances the arrays in passing, and
+	 * that only needs a little extra logic to deal with inequality scan keys.
+	 */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -749,6 +1925,19 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here.  They
+ * expect to be able to call _bt_update_keys_with_arraykeys instead.  We need
+ * to be careful about that case when we determine redundancy; equality quals
+ * must not be eliminated as redundant on the basis of array input keys that
+ * might change before another call here can take place.
+ *
+ * Note, however, that the presence of an array scan key doesn't affect how we
+ * determine if index quals are contradictory.  Contradictory qual scans move
+ * on to the next primitive index scan right away, by incrementing the scan's
+ * array keys once control reaches _bt_array_keys_remain.  There won't be a
+ * call to _bt_update_keys_with_arraykeys, so there's nothing for us to break.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -895,8 +2084,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							so->qual_ok = false;
 							return;
 						}
-						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						else if (!(eq->sk_flags & SK_SEARCHARRAY))
+						{
+							/* else discard the redundant non-equality key */
+							xform[j] = NULL;
+						}
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -986,6 +2178,22 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
+		/*
+		 * Is this an array scan key that _bt_preprocess_array_keys merged
+		 * with some earlier array key during its initial preprocessing pass?
+		 */
+		if (cur->sk_flags & SK_BT_RDDNARRAY)
+		{
+			/*
+			 * key is redundant for this primitive index scan (and will be
+			 * redundant during all subsequent primitive index scans)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(j == (BTEqualStrategyNumber - 1));
+			Assert(so->numArrayKeys > 0);
+			continue;
+		}
+
 		/* have we seen one of these before? */
 		if (xform[j] == NULL)
 		{
@@ -999,7 +2207,26 @@ _bt_preprocess_keys(IndexScanDesc scan)
 										 &test_result))
 			{
 				if (test_result)
-					xform[j] = cur;
+				{
+					if (j == (BTEqualStrategyNumber - 1) &&
+						((xform[j]->sk_flags & SK_SEARCHARRAY) ||
+						 (cur->sk_flags & SK_SEARCHARRAY)))
+					{
+						/*
+						 * Must never replace an = array operator ourselves,
+						 * nor can we ever fail to remember an = array
+						 * operator.  _bt_update_keys_with_arraykeys expects
+						 * this.
+						 */
+						ScanKey		outkey = &outkeys[new_numberOfKeys++];
+
+						memcpy(outkey, cur, sizeof(ScanKeyData));
+						if (numberOfEqualCols == attno - 1)
+							_bt_mark_scankey_required(outkey);
+					}
+					else
+						xform[j] = cur;
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1027,6 +2254,98 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+/*
+ *	_bt_update_keys_with_arraykeys() -- Finalize advancing array keys
+ *
+ * This function just transfers newly advanced array keys that were set in
+ * "so->arrayKeyData[]" over to corresponding "so->keyData[]" scan keys.  This
+ * avoids the full set of push-ups that take place in _bt_preprocess_keys at
+ * the start of each new primitive index scan.  In particular, it avoids doing
+ * anything that would be considered unsafe while holding a buffer lock.
+ *
+ * Note that _bt_preprocess_keys is aware of our special requirements when
+ * considering if quals are redundant.  For full details see comments above
+ * _bt_preprocess_array_keys (and above _bt_preprocess_keys itself).
+ */
+static void
+_bt_update_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	Assert(so->qual_ok);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		Assert((cur->sk_flags & SK_BT_RDDNARRAY) == 0);
+
+		/* Just update equality array scan keys */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Update the scan key's argument */
+		Assert(cur->sk_attno == skeyarray->sk_attno);
+		cur->sk_argument = skeyarray->sk_argument;
+	}
+
+	Assert(arrayidx == so->numArrayKeys);
+}
+
+/*
+ * Verify that the scan's "so->arrayKeyData[]" scan keys are in agreement with
+ * the current "so->keyData[]" search-type scan keys.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Verify so->arrayKeyData[] input key has expected sk_argument */
+		if (skeyarray->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+
+		/* Verify so->arrayKeyData[] input key agrees with output key */
+		if (cur->sk_attno != skeyarray->sk_attno)
+			return false;
+		if (cur->sk_argument != skeyarray->sk_argument)
+			return false;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1360,58 +2679,267 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the final
+ * tuple (the high key for a forward scan) early, before we've expended too
+ * much effort on comparing tuples that cannot possibly be matches for any set
+ * of array keys.  This is just an optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate.  These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards).  Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction).  Any other order will
+ * lead to inconsistent array key state.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
  * requiredMatchedByPrecheck: indicates that scan keys required for
  * 							  direction scan are already matched
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup,
 			  bool requiredMatchedByPrecheck)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	int			natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+	int			sktrig;
+
+	Assert(pstate->continuescan);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+							&pstate->continuescan, &sktrig,
+							requiredMatchedByPrecheck);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * continuescan=false.
+	 */
+	if (!so->numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.
+	 *
+	 * While we might really need to end the top-level index scan, most of the
+	 * time this just means that the scan needs to reconsider its array keys.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, pstate, tuple, sktrig))
+	{
+		/*
+		 * Current tuple is < the current array scan keys/equality constraints
+		 * (or > in the backward scan case).  Don't need to advance the array
+		 * keys.  Must decide whether to start a new primitive scan instead.
+		 *
+		 * If this tuple isn't the finaltup for the page, then recheck the
+		 * finaltup stashed in pstate as an optimization.  That allows us to
+		 * quit scanning this page early when it's clearly hopeless (we don't
+		 * need to wait for the finaltup call to give up on a primitive scan).
+		 */
+		if (finaltup || (!pstate->finaltupchecked && pstate->finaltup &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->finaltup, 0)))
+		{
+			/*
+			 * Give up on the ongoing primitive index scan.
+			 *
+			 * Even the final tuple (the high key for forward scans, or the
+			 * tuple from page offset number 1 for backward scans) is before
+			 * the current array keys.  That strongly suggests that continuing
+			 * this primitive scan would be less efficient than starting anew.
+			 *
+			 * See also: finaltup remarks after the _bt_advance_array_keys
+			 * call below, which fully explain our policy around how and when
+			 * primitive index scans end.
+			 */
+			pstate->continuescan = false;
+
+			/*
+			 * Set up a new primitive index scan that will reposition the
+			 * top-level scan to the first leaf page whose key space is
+			 * covered by our array keys.  The top-level scan will "skip" a
+			 * part of the index that can only contain non-matching tuples.
+			 *
+			 * Note: the next primitive index scan is guaranteed to land on
+			 * some later leaf page (ideally it won't be this page's sibling).
+			 * It follows that the top-level scan can never access the same
+			 * leaf page more than once (unless the scan changes direction or
+			 * btrestrpos is called).  btcostestimate relies on this.
+			 */
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/*
+			 * Stick with the ongoing primitive index scan, for now (override
+			 * _bt_check_compare's suggestion that we end the scan).
+			 *
+			 * Note: we will end up here again and again given a group of
+			 * tuples > the previous array keys and < the now-current keys
+			 * (though only after an initial finaltup precheck determined that
+			 * this page definitely covers key space from both array keysets).
+			 * In effect, we perform a linear search of the page's remaining
+			 * unscanned tuples every time the arrays advance past the key
+			 * space of the scan's then-current tuple.
+			 */
+			pstate->continuescan = true;
+
+			/*
+			 * Our finaltup precheck determined that it is >= the current keys
+			 * (although the current tuple is still < the current array keys).
+			 *
+			 * Remember that fact in pstate now.  This avoids wasting cycles
+			 * on repeating the same precheck step (checking the same finaltup
+			 * against the same array keys) during later calls here for later
+			 * tuples from this same leaf page.
+			 */
+			pstate->finaltupchecked = true;
+		}
+
+		/* In any case, this indextuple doesn't match the qual */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (or at least notice that the required array keys have been
+	 * exhausted, which will end the top-level scan).
+	 *
+	 * Note: we might even advance the arrays when all existing keys are
+	 * already equal to the values from the tuple at this point.  See comments
+	 * about inequality-driven array advancement above _bt_advance_array_keys.
+	 */
+	if (_bt_advance_array_keys(scan, pstate, tuple, sktrig))
+	{
+		/* Tuple (which didn't match the old qual) now matches the new qual */
+		Assert(pstate->continuescan);
+		return true;
+	}
+
+	/*
+	 * At this point we've either advanced the array keys beyond the tuple, or
+	 * exhausted all array keys (which will end the top-level index scan).
+	 * Either way, this index tuple doesn't match the new qual.
+	 *
+	 * The array keys usually advance using a tuple from before finaltup
+	 * (there can only be one finaltup per page, of course).  In the common
+	 * case where we just advanced the array keys during a !finaltup call, we
+	 * can be sure that there'll be at least one more opportunity to check the
+	 * new array keys against another tuple from this same page.  Things are
+	 * more complicated for finaltup calls that advance the array keys at a
+	 * page boundary.  They'll often advance the arrays to values > finaltup,
+	 * leaving us with no reliable information about the physical proximity of
+	 * the first leaf page where matches for the new keys are to be found.
+	 *
+	 * Our policy is to allow our caller to move on to the next sibling page
+	 * in these cases.  This is speculative, in a way: it's always possible
+	 * that the array keys will have advanced well beyond the key space
+	 * covered by the next sibling page.  And if it turns out like that then
+	 * our caller will incur a wasted leaf page access.
+	 *
+	 * In practice this policy wins significantly more often than it loses.
+	 * The fact that the final tuple advanced the array keys is an encouraging
+	 * signal -- especially during forwards scans, where our high key/pivot
+	 * finaltup has values derived from the right sibling's firstright tuple.
+	 * This issue is quite likely to come up whenever multiple array keys are
+	 * used by forward scans.  There is a decent chance that every finaltup
+	 * from every page will have at least one truncated -inf attribute, which
+	 * makes it impossible for finaltup array advancement to advance the lower
+	 * order arrays to exactly matching array elements.  Workloads like that
+	 * would see poor performance from a policy that conditions going to the
+	 * next sibling page on having an exactly-matching finaltup on this page.
+	 *
+	 * Cases where continuing the scan onto the next sibling page is a bad
+	 * idea usually quit scanning the page before even reaching finaltup; just
+	 * making it as far as finaltup is a useful cue in its own right.  This is
+	 * partly due to a promise that _bt_advance_array_keys makes: it always
+	 * advances the scan's array keys to the maximum possible extent that is
+	 * sure to be safe, given what is known about the scan when it is called
+	 * (namely the scan's current tuple and its array keys, though _not_ the
+	 * next tuple whose key space is covered by any of the scan's arrays).
+	 * That factor limits array advancement using finaltup to cases where no
+	 * earlier tuple could bump the array keys to key space beyond finaltup,
+	 * despite being given every opportunity to do so by us (with some help
+	 * from _bt_advance_array_keys).
+	 *
+	 * Chances are good that finaltup won't be all that different to earlier
+	 * nearby tuples: it is unlikely to make the tuple-wise position that
+	 * matching tuples start at jump forward by a great many tuples, either.
+	 * In particular, it is unlikely to jump by more tuples than caller will
+	 * find on the next leaf page.  That's why it makes sense to allow the
+	 * ongoing primitive index scan to at least continue to the next page.
+	 */
+
+	/* In any case, tuple doesn't match the new qual, either */
+	return false;
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (which has more high
+ * level context than us) to override that initial determination when it makes
+ * more sense to advance the array keys and continue with further tuples from
+ * the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool *continuescan, int *sktrig,
+				  bool requiredMatchedByPrecheck)
+{
 	int			ikey;
 	ScanKey		key;
 
-	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!so->numArrayKeys || !requiredMatchedByPrecheck);
 
 	*continuescan = true;		/* default assumption */
+	*sktrig = 0;				/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = so->keyData, ikey = 0; ikey < so->numberOfKeys; key++, ikey++)
 	{
 		Datum		datum;
 		bool		isNull;
 		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
 		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * opposite direction.  Also set an offset to this scan key for caller
+		 * in case it stops the scan (used by scans that have array keys).
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
+		*sktrig = ikey;
 
 		/*
 		 * Is the key required for scanning for either forward or backward
@@ -1419,7 +2947,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * known to be matched, skip the check.  Except for the row keys,
 		 * where NULLs could be found in the middle of matching values.
 		 */
-		if ((requiredSameDir || requiredOppositeDir) &&
+		if ((requiredSameDir || requiredOppositeDirOnly) &&
 			!(key->sk_flags & SK_ROW_HEADER) && requiredMatchedByPrecheck)
 			continue;
 
@@ -1522,11 +3050,28 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 
 		/*
 		 * Apply the key checking function.  When the key is required for
-		 * opposite direction scan, it must be already satisfied by
-		 * _bt_first() except for the NULLs checking, which have already done
-		 * above.
+		 * opposite-direction scans it must be an inequality satisfied by
+		 * _bt_first(), barring NULLs, which we just checked a moment ago.
+		 *
+		 * (Also can't apply this optimization with scans that use arrays,
+		 * since _bt_advance_array_keys() sometimes allows the scan to see a
+		 * few tuples from before the would-be _bt_first() starting position
+		 * for the scan's just-advanced array keys.)
+		 *
+		 * Even required equality quals (that can't use this optimization due
+		 * to being required in both scan directions) rely on the assumption
+		 * that _bt_first() will always use the quals for initial positioning
+		 * purposes.  We stop the scan as soon as any required equality qual
+		 * fails, so it had better only happen at the end of equal tuples in
+		 * the current scan direction (never at the start of equal tuples).
+		 * See comments in _bt_first().
+		 *
+		 * (The required equality quals issue also has specific implications
+		 * for scans that use arrays.  They sometimes perform a linear search
+		 * of remaining unscanned tuples, forcing the primitive index scan to
+		 * continue until it locates tuples >= the scan's new array keys.)
 		 */
-		if (!requiredOppositeDir)
+		if (!requiredOppositeDirOnly || so->numArrayKeys)
 		{
 			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
 									 datum, key->sk_argument);
@@ -1544,15 +3089,25 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * Tuple fails this qual.  If it's a required qual for the current
 			 * scan direction, then we can conclude no further tuples will
 			 * pass, either.
-			 *
-			 * Note: because we stop the scan as soon as any required equality
-			 * qual fails, it is critical that equality quals be used for the
-			 * initial positioning in _bt_first() when they are available. See
-			 * comments in _bt_first().
 			 */
 			if (requiredSameDir)
 				*continuescan = false;
 
+			/*
+			 * Always set continuescan=false for equality-type array keys that
+			 * don't pass -- even for an array scan key not marked required.
+			 *
+			 * A non-required scan key (array or otherwise) can never actually
+			 * terminate the scan.  It's just convenient for callers to treat
+			 * continuescan=false as a signal that it might be time to advance
+			 * the array keys, independent of whether they're required or not.
+			 * (Even setting continuescan=false with a required scan key won't
+			 * usually end a scan that uses arrays.)
+			 */
+			if ((key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+				*continuescan = false;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1571,7 +3126,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 03a5fbdc6..e37597c26 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +862,20 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			/*
+			 * We might need to omit ScalarArrayOpExpr clauses when index AM
+			 * lacks native support
+			 */
+			if (!index->amsearcharray && IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
+				if (skip_nonnative_saop)
 				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
+					/* Ignore because not supported by index */
+					*skip_nonnative_saop = true;
+					continue;
 				}
+				/* Caller had better intend this only for bitmap scan */
+				Assert(scantype == ST_BITMAPSCAN);
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 35c9e3c86..2b622b7a5 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6512,8 +6512,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6528,7 +6526,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6558,19 +6556,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6608,27 +6595,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6638,11 +6629,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6658,10 +6647,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6674,7 +6661,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6752,7 +6739,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6767,17 +6753,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6817,14 +6798,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				int			alength = estimate_array_length(other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6884,13 +6860,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6900,6 +6869,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6907,9 +6918,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6926,7 +6937,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 42509042a..1515bbd40 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Only queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values are affected.  See
+    <xref linkend="functions-comparisons"/> for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 2c7327014..86e541780 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8680,10 +8680,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 SET enable_indexonlyscan = OFF;
 
 explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 RESET enable_indexonlyscan;
 
 --
-- 
2.42.0

#25

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Peter Geoghegan (#24)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On 21/11/2023 04:52, Peter Geoghegan wrote:

Attached is v7.

First, some high-level reactions before looking at the patch very closely:

- +1 on the general idea. Hard to see any downsides if implemented right.

- This changes the meaning of amsearcharray==true to mean that the
ordering is preserved with ScalarArrayOps, right? You change B-tree to
make that true, but what about any out-of-tree index AM extensions? I
don't know if any such extensions exist, and I don't think we should
jump through any hoops to preserve backwards compatibility here, but
probably deserves a notice in the release notes if nothing else.

- You use the term "primitive index scan" a lot, but it's not clear to
me what it means. Does one ScalarArrayOps turn into one "primitive index
scan"? Or each element in the array turns into a separate primitive
index scan? Or something in between? Maybe add a new section to the
README explain how that works.

- _bt_preprocess_array_keys() is called for each btrescan(). It performs
a lot of work like cmp function lookups and desconstructing and merging
the arrays, even if none of the SAOP keys change in the rescan. That
could make queries with nested loop joins like this slower than before:
"select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1
and tenk1.two IN (1,2);".

- nbtutils.c is pretty large now. Perhaps create a new file
nbtpreprocesskeys.c or something?

- You and Matthias talked about an implicit state machine. I wonder if
this could be refactored to have a more explicit state machine. The
state transitions and interactions between _bt_checkkeys(),
_bt_advance_array_keys() and friends feel complicated.

And then some details:

--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</para>
</note>

+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Only queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values are affected.  See
+    <xref linkend="functions-comparisons"/> for details.
+   </para>
+  </note>
+

Is this true even without this patch? Maybe commit this separately.

The "Only queries ..." sentence feels difficult. Maybe something like
"For example, queries using IN (...) or = ANY(...) constructs.".

* _bt_preprocess_keys treats each primitive scan as an independent piece of
* work. That structure pushes the responsibility for preprocessing that must
* work "across array keys" onto us. This division of labor makes sense once
* you consider that we're typically called no more than once per btrescan,
* whereas _bt_preprocess_keys is always called once per primitive index scan.

"That structure ..." is a garden-path sentence. I kept parsing "that
must work" as one unit, the same way as "that structure", and it didn't
make sense. Took me many re-reads to parse it correctly. Now that I get
it, it doesn't bother me anymore, but maybe it could be rephrased.

Is there _any_ situation where _bt_preprocess_array_keys() is called
more than once per btrescan?

/*
* Look up the appropriate comparison operator in the opfamily.
*
* Note: it's possible that this would fail, if the opfamily is
* incomplete, but it seems quite unlikely that an opfamily would omit
* non-cross-type comparison operators for any datatype that it supports
* at all. ...
*/

I agree that's unlikely. I cannot come up with an example where you
would have cross-type operators between A and B, but no same-type
operators between B and B. For any real-world opfamily, that would be an
omission you'd probably want to fix.

Still I wonder if we could easily fall back if it doesn't exist? And
maybe add a check in the 'opr_sanity' test for that.

In _bt_readpage():

/*
* Prechecking the page with scan keys required for direction scan. We
* check these keys with the last item on the page (according to our scan
* direction). If these keys are matched, we can skip checking them with
* every item on the page. Scan keys for our scan direction would
* necessarily match the previous items. Scan keys required for opposite
* direction scan are already matched by the _bt_first() call.
*
* With the forward scan, we do this check for the last item on the page
* instead of the high key. It's relatively likely that the most
* significant column in the high key will be different from the
* corresponding value from the last item on the page. So checking with
* the last item on the page would give a more precise answer.
*
* We skip this for the first page in the scan to evade the possible
* slowdown of point queries. Never apply the optimization with a scans
* that uses array keys, either, since that breaks certain assumptions.
* (Our search-type scan keys change whenever _bt_checkkeys advances the
* arrays, invalidating any precheck. Tracking all that would be tricky.)
*/
if (!so->firstPage && !numArrayKeys && minoff < maxoff)
{

It's sad to disable this optimization completely for array keys. It's
actually a regression from current master, isn't it? There's no
fundamental reason we couldn't do it for array keys so I think we should
do it.

_bt_checkkeys() is called in an assertion in _bt_readpage, but it has
the side-effect of advancing the array keys. Side-effects from an
assertion seems problematic.

Vague idea: refactor _bt_checkkeys() into something that doesn't have
side-effects, and have a separate function or an argument to
_bt_checkkeys() to advance to next array key. The prechecking
optimization and the Assertion could both use the side-effect-free function.

--
Heikki Linnakangas
Neon (https://neon.tech)

#26

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Peter Geoghegan (#24)

3 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On 11/21/23 03:52, Peter Geoghegan wrote:

On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Thanks. Here's my review of the btree-related code:

Attached is v7.

I haven't looked at the code, but I decided to do a bit of blackbox perf
and stress testing, to get some feeling of what to expect in terms of
performance improvements, and see if there happen to be some unexpected
regressions. Attached is a couple simple bash scripts doing a
brute-force test with tables of different size / data distribution,
number of values in the SAOP expression, etc.

And a PDF visualizing the comparing the results between master and build
with the patch applied. First group of columns is master, then patched,
and then (patched/master) comparison, with green=faster, red=slower. The
columns are for different number of values in the SAOP condition.

Overall, the results look pretty good, with consistent speedups of up to
~30% for large number of values (SAOP with 1000 elements). There's a
couple blips where the performance regresses, also by up to ~30%. It's
too regular to be a random variation (it affects different combinations
of parameters, tablesizes), but it seems to only affect one of the two
machines I used for testing. Interestingly enough, it's the newer one.

I'm not convinced this is a problem we have to solve. It's possible it
only affects cases that are implausible in practice (the script forces a
particular scan type, and maybe it would not be picked in practice). But
maybe it's fixable ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

saop-benchmark.pdfapplication/pdf; name=saop-benchmark.pdfDownload

%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(��Google Sheets)
/Title
(��saop benchmark)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x��]]�����/��������@���W�$~�.f���,�x�$3yJ�]s�H ?��YT�t���a����)��UE)�_/�a�����]���^�z��}�>��j��,���5�|��b���(�\���N�n{#�����2��~�������v�w�t��7���/B.v3��.����G)��"��'�ok{�y<RS�Qh��J�����#3{�[BZ��[�kM�������-M��_wlJ�+�o\wt��U04�o�?��O���{8���{��1�����:��
�|�
��o\wt�S�%]cc��p��8	�^����}�4��t�h/*�������"��5p�t|�vt����\wt�.�W06���p��8���8�U�uG��������]���
.�7�\wtZ�*�q���L:��E�K��+�}&�W'��,��W���p��8���<�����.@\���Y$�\wt��5Tg��p��y����+������.@	\��#	��;�g�*(�HPp��;xh\��,��;�
p���H������l�_���`�7_M����+8��Zp���@W����0�]���������p)�qcc��$��d���kx��A$H�h���P�E�uG����+������.�	�~
�Y$�\wt.��k8�E�uG��_�����]���
���?����(���io��Tz�����������F��h����Bh���V.��jp���	��	���.�K������Z	c=�~
 LQ�V��b�0>,��j\m,Z�zTk�!���p?�6�����f�`�.�(��+X�,v����I����eq)�fl�����dy�q4o�o	Q����������J�^��C#�e
���B������9Z�����%mMR���.my���N��UXpw\��,?z�,Uy+X�u�"hm��
�.�h��������V�L��1Y��������7� ��{20k��YV���Xn*e��������\*)�p��X�	�+Xg�����Y�l�j`��h"4� �xk�L&���J�mXB�9,�3�$�4M�X�
�+X�o���`� X9��3���`���E���8��e10��E3��b&q�M,�4`��+i��`��@7�aC��I����@�IA4+���&18��IH�7�f
6�l2��$�r�� 	��'1�a��{f���`-��!��$������$@�����WP���l�j`��	��VO}���b,0�h�MXB�'	/���4�p�
��q'�N�`Q�H
�f�Bh\Ah�����EAx�,�=$BV�2����i~�a,���\T��H��&�q�[B�A)d�&�Rsh\Ah��~g�QD
4��X�bV����`�c�=��=P!��+��w����F�U}=4�`�%Z��'X
�5q�����W��=�y~���\'�T�?
�����������x�����f�����,?��=Ls�]"06~\�_F���d�\v��<����H��=�g�wk(�#�Qf{�6/18���Y�XH���*��q�������
�5��@��[eUi).-����-���#���,��$�V�A��`l���
��19Y�K���5���<�7���*8�x\�)|z���I��I'�kX���"i��&KQ9���-�����c��/{��������&���J6h50�cb�
��wU��
\s��I���a@�$O�i�-k���yceCi�==�-H�5�p��,o*���� %�-FB�c6�H�f���5����n��@��0�Z@���DM�����f�y������	��l��VC�Z���	���b^����3�1dS�23�5pt��R6�g2A6�+@��!i.{}a�����^\�	�����7�p�^��Y-���R	.������`�PV\a��q�d �-T ��p![n����0�]��Z	HA���
��B���=@^J&T ����FF���E�F�q���)]�31
��m4i���`�W� �fv
��5T �.k/���k&�
���i��AhpJ�,}�P��h.���&�����n����x�nX�T�W06�]�{�������QK���bk{'.�����N��������Z�+`R2�a���+X�;�%q����������Z����u�d0�\h� 4x6I���=W�`�rZV�u��J0��0�:�af
%(���e�����^����A���P�	�4K�B�
Jz�)*�f�r�����OA����K7�oYs��-F��^:�K8�[���S�����AE-�T���4���5��o1������X?-[�����`�����������q/����l���s��������~�6�������Rm�����_�{��hR�"�A�jr�&}�A�jv!>}�A�j�/��<�_�q�_yPn�J_y���2�o�s��
h�(Zy�������c�NV�`M!���DSH�Q�"O�z�}p����?���������R?�<�?\&�����������b�1�|����1�������������������|�����3�7!p�_�q����1kK2�v����fx7|��U0���O
��kg��S���=��5����Y�v��X�W�m�	'~�D�z��I��g*���d��~������u2|��~7|jb���:����������j�0m�����~1�o��������N���lp�������w��&b�b���������8^������;��.Q���-��3�1����fg���@nFqD�2��t|��X�O#���L��p�����f��'�6�������fS�l=�ug��1������M]���{V�e���Sd��5��m9
%�������wn����OF@�	�u)�54�|��~����-�w[j�������(�so"�y�<�o��s��	j��P�_/�(�2=Xi���5�)x_;��m�f~Pnb2���/����"��/��9�U�P�������w���)���1�+�$������n��.g-B�����6F��v(�}s6.]+�� ���izM	k`���~��
�	��5|���3��T�- 9'�WP�5O���������e~=���!��vC���F^�T������+3���2*N�#���}rjo9rR���f�#=�>����f�U��5&��M�oO�Zq���}����>�>c'W�^m���
w$���W���W�C�<��sq��"f����y6,�D��:F��(c��0�����X����w��5��p7:`�o(�k��Xq�L��t!4�}����8��>�n-�F�Q�v����q~�1.������5b������[�[�Ss�2R.O�#8����3�$t�i�z$�Z�n�R&.��Y���C9��)���:��0�;�np%D���/Ad���Q��m���>��&���$4J����|��#+�GP�N>r������#���������93�$4��^.OB���B���i���o��?�r�Z���Y���qV1q�>���Yv�i�������l.�:���3Lu��r��ko=r����U���(�"N�;n���Px��y�iy��+N��Q�����pA��E����r.H��8�.H�q�~����S�/
f$�����gr������J�r�*J���Q�.Ye�Y��7���K�z�]����K�����\� ��t�3�g{����=%���(�w@vN�,G���Y�7��X�N�R���q��S����8J�*�y�,��>g�g�3���K^��	��g�Chf��N����!�s�+i�R���Q�fn}�"j�s����/m0=�`�zJ�����������J�����ZI��n��������G��P9[�������2���;���S��13�	�����E����n�2�h_BL
���u��]��L�<�I�NA�2}e���L7�"f�.Cy���S9���!(bf�Ay��g��F���!��>�F8��(a�G�,����������K�x"������T�������t$��.��&:E&Z����q���n_|���3�B�������o�O^�2�����w����k&/$��o�����v${�� \/$���MU��7�	�[P���y}zUr�?J&H��e��XQb�G�:Y��8�O������})\�������� �����7;���/���@Vb���J�s����<���
/���Y�M$/�i�\F/�a'�#A���~�f�����������=� �R{j�
$J�g2�R���������t�N�gn�Ab�;����v��`�5s�e�>9�G�@K�r���=�y�8����b}p��������rH��f=��e��i4��3?k.��H� �{�B$J��Ewu��Ex��I���u��)�;�<�	��+/Yf��;��p�W��(3���-���r����o�F�k457b��\�0�s�:��R��P3�FE���c}=�YF�;��2��cQ��/����0����T��]�EY��i���m\pq>B��5�Z�C�k��S�������$q�\Z(}�r�n�i�/`C����4���������m/(�U����o�9��M���Tej�����$/7a�w�C�"���+�)QZ�cQ_��XD���^^'�n��^��23�9�������O��M��m�?a����7sp��rWk�����z��0���o�9d�=o�������El��w�(���rR��/���
�m_(uB������3n]���
�e�-l����0�c+��;��x���p�{�Cd)�����?1;�P*��y{�}��>������gD�4���Q#E#��uD�����
�\!�Y9�l�@	�T�Pj��>(k�����_�%�k,��;k�)F�[��$��s��:���0����>�$^�`vz�G"Q���E��m����3~�Iv��Z�G����79-s1e���>���v8/oe��'xR� �&P��C��m��c�>g�h_-
?��h�mG��q1����	g�0�G��?��q^f5O�5����0��'8���9s��x}�(�����'���U
e�$��+�]�9��nZ{�|_��?���(C
endstream
endobj
8
0
obj
5234
endobj
9
0
obj
[
]
endobj
12
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
13
0
R
/Resources
14
0
R
/Annots
16
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
13
0
obj
<<
/Filter
/FlateDecode
/Length
15
0
R
>>
stream
x��]]�$9��GToW��3��?����~E|\�������Z@ $�>���qU�#jG���e�S��������������C���?����s�6��G������������#����[W�[[M������YB[�mf�0�������0-�`n����	�a���V#j���7��o_`���?���k�qco�����j�7��N15���O���{)����6�RH1�� �U�v%�=(�>�������%6%�fT�)��$B]+�
G�0v������D��a��>}C7V}X~����zw�iX���7����~m��3V� ��d��|k�@I>�ij��]~
C5N��cS/��F�mF��y-;��j�X"|��fuE� ��_@8�&���E�?.���K��������:�������T��w�q�I(���"�MxS�L��\���2�Q�0S���rw/�y�����3�hFY�$C�S�eE�mY%������"�
��+���)��n��&�/[��T9��1���kI�!�t�I�0�P��������B�k�w�
�"�21Wd@I����1h�����;f��������l��I�?R%L*�>���I���H�����6,&�y��9�{��l��q_�n���I���X�v}5wj���T���t��|SZ%`�U�N��a���q�U��(�d�{�D�}���������[�<����_����
���0���S����T�J�������1h�q�Z�B�C�O-O�bG��cIt�,��O�]�@6i����}��4����w�����(F-�c���)r1�r=���s�
�	k��������_'��U�����B�S|cd�cZ����%�D�S������B�������(G���e�0V��6�&��YvT�%Ye�CK2�>�B�B�~b:��t}�C�w:�#������Z�I���Q�:}��K�u��)��BL��c��������F}y){u��]�\m�5�����4r�
Z%`�U��^L��N Ui�CL*�>�B53�����;��ljm.SL���_��Qt����;�"�D�Zrt����;f��}�U9�Lnj�97,�j�y�����(T��(0�������#�](�n��<`!{C�kZm?7,B;h�y�������&�<Y������w,�D]�����C�S�u�-�!;��(8���M��h3i�y���������� �I���"�GC���^P��)>
^�\6�6��m+m�RL�/,>��L��H ���?'m�Xw����OqziW��������d��W��\�yP���bY���X�e���q�&GR;.���)�jF�]m�m����j;m��Q���}���W��q���k���J�n��+����������k���:�����5�W���j�9h�u����\��N��(�c��)����qa�U�����*cp�G�n�N��}����Lt;��k���0���fB�o����;����D�������4�r�qZ$`|��!)mhY�p��Xk^]}�^�{�����7|��)ny#<�v���(�S�Y��<"��F��������+�wq��������w���x��������U#V���v3���S,To*�]�M_�p�6�igy�M~w�:�$����:m�XH��;m�6,:0h�q�]�>#���i�����p��E`Zi���E�	j]�������}�z7���4�{F�O���Z�dQ��A{�.h�s���]��wA{�
��k�}RL�/x!���i/)�R��B����^����I��u9D���6E}�M��6���.m�X�p����=�\�q�,GS���������+�}��������tp����7���hO�:�>�Z�j��o�g��
/�7�j��{�K�����e�Y�[���t�$W�1��q��eWhZv:U�X�����}\��8Q�h���Y?�	{?�	{O+��>������Cz�Ew�+�
�Jt�r�k9��6�X�{Q���L�X����7���m������z��2�>�:�W6����}��_Z��������NA�D�a��f��}��(�<f����Z��cY�G��L�8������P���n%U�F����q�k�����(4�~�uwj��Z�k����'Qv���#�d�5s������z.�����]�Z|cA�,=�����
qOcc��e�p_�>\#��[Z���)n��G=�7����{m�X�NK��&��'y��J;A�%����V��w-�rWjw8��
���������p�"����;�^9�Z��C��8Zz�U	O��R�3%1��C��%U�v�����X_��1����7�;V�g����&e�P�F��Yv�Q&@�e�����c�Gm�Yu�c�-o�����@��6�n��=�J�E9��C���,GT��(0V�c�����u*���P���F;����K8�-�F������Po��,�R�(0V��zU�%.��iQ3dpm�Y�p���+�87���tB���v1�7>�N���X�rD�A�"c%:�Q�;Q��},�i{���m��;n��	�J;�
���}��wV����I�5�J3�8���v�\Fv���E	��	���^;uK��.y�1��4S�O�r��If�G��V��$�,GT�d����8����$�J[�
� ���Oq�\�m��Y�c�
k	+�IO�b��5��`j����>��"o|�aj�U�p�3S���7�F�p���H{����N{�V��3��A�U��W�*c�z�OI��P�T���)z7=�0[;���v�����w�����tzE��Ew<h7z�Z4� 5i�T��I��{��F������
�,w���������������F`���CD�S<d�/���a��;�����Jq,�����E	��X��k�������W����x
�
����j�1�~n~��c�!��}��=8�sa}?��Q�nX<Z�V��%����Z��`�g'm�Yv�Q&@�eG~�PxV5�f=��I��Pk���4���y�v��m��>�Z���\Y�c��x�	+]��$�r�����J����������jc�a�Q��n�U��4�(�T���k~�JtZ���A.`%�98h���.�^�����.�������G�1
��#��'l�a���|i�!�Q;���C���:5���y������������o~������o�XS7U->���B�n�?��V|�!���:�����a���7�^��w����^���%:��I~�
���L2
��k?�2���_����?�����>���O�
A�L�w�\���6o���_��~�����N~t���������__��9v��������o���*d�y�T�95���S���p���sq�/?���E�K�2���?��e�����>���a�O<��	k@.������kj������e��������7�)�^��=�3�_�Y�=���7���)���{������i&�hNA��q��e�2}�!S�721�Ky��)��_O��)���|����w�����hW1�y\E�.�*"�W1�� ��I����x����x���}�sQ/�W���%�rQ��������:��V�^�*�vW�O
b����Y��o8d
"�%M���LA��<s?ZSQ�f�qd�Q,��E���4l��zV�1�c�/�T�vI���N�}w������ff��q������1zYo��)�8g��6��C��"
_��pR���dCu����&�
r%v��eYOsa.���i�K����.���a&/,y�!j���0��}�����P�U��F��R�D��<Ib%�[+$~(���A����L�D�X��X���,*Q6��J��
�m�!�����b��u(Cl�Cb�2$.�C�L������0V��z���������SNT6D�l�?���YV6�(���s�g��
1.w�<��|e��-��,����!j����_�W��u�n�,D�R�r����X��������
�X�|c!kQ��^�ra��"�����e���`�Z�Nk�~u�s��/�O��b���#s�G�7���R���i��e���]l$"~)?�!��o,d��t��P.�}��Wv����|	?�
L\��)��{�N�h��%��K�!����f�}������=�r����/������Is�45<�iK3����zd3$�0���#��c!�9��i6��<�f��eC��R��q>g��ci����'������%���aC$f_.���
��>�l������������}��%�0q�7g4=>g������c5�u�q|�3oH�Z�y�Z��BV�`�j��X-2�a��tx-Km_����x���v�R��"~�x2#���e��
�LA�x���Q��d!�a�(�`�;� ��R��z��5d����U�jV��$�L���$�72�}�!�U�h�ae�a5G�0�"j������
c�Aj�a{Lk!��|2��s,D�9C�_6V����9���VM#j�;�n��Od�m����D���P�� dt����X	BZ���Yb�C�����X�������q�U�7����"]�Hlt&f��N���vyR{����W)3��6�JQ�s���0=��(G\v(E�#D���@,d~����r�gY���@�(G�H��v�D�0�������`��|���6<Hg���l\�i'x��:����9�u�s���s.D�3������(r�q�E�7����"m�sa���\������Q[^�%�:W1��%�KQW9�q��R{�������@�����X� ��>9�C��q��9��e#]Z2�e"�lx����6������8��N<�c�O���m�k}����L����+�rZ���h��5 ��r��o�AZ�$.ODY�������0��C&�x��������X=��R����3gbJ��n�L!u���H@D��x��	���|Ny;�C&`.�w�����L2��,��&��i�$K]z:����/���GVD�2�"��7���F��Hb~;�s8db��C&6��GCb����]FVL�0��S�Mw;
�]�<"v�A�����}j������$"q�?y$ ��x[>$~X����j�����)���g�,�5�t����>jCZ�Q���2?�<��Q4��!���xX��%�!��ld�������NM)����)�.$W�!"���,�?�_��:p�p�4�����iL�|<J���nH�1vg�l���s�;�N�i����~��ylE�.c+".��2��6!/��ER��C&�o<dR��a}�R�����<�zSS
�GE"j�����|'Z�!b~�yP�92
����#m8*�VDm9I�I����\�/�R���f��6T�r���Q!�S��?8���������~����
��#��=�=��c�����cE�c�g� ��a�
e��y�9��P��K�0��w��r:����
����������$~qxG#$~.���k�P���"��x�t��Z����g	�A���Q���,��s�E�2�n�p���_�Hu���u(Cl?G
�_~�$~.^:��YC{��UfX�3,H�R�r����#.�EeC�.O�:�BV6gfQ�r������!f����Pv���k_cmC����
��6D��z�c���U�J��]*Q��\�r�V��
DT"� �J�eY�r��J���\�0���!��e��������W"�F6�����y���F��{t��������
[��b���,B9f��d6=������/\+����UH�l��cC���UH��}9b�"�<}��72�r��e�lx�+����cU��2,�{��|���c}�W�fX�c)����,�u�?z�>w����,�yB��Gi�fH�1�������kl�v�R��^��:����������{9f�<2���<6C�1���;���sq�O}�=>�f������#5�u�9bs�@����X-|c!��o0d��E�:���/�iBb��n���/��G��.��#�z^H���q+�X���1�s1��#5"6-{��C��R��/�g��u������u,����%��^Hl� �|v����,���B���	A�W���B���QCb������o��.��#��c!���<�92�r�f���F��U�{���u#b�/�������y������~Q����_������(gz�2:��Lp��!��Q�,�������?y�q����Q�u4�l���������!���^,���nu ��$./�T���	��.u(Gl�C9b�
$6�f
Q�|c!��3��C9f��d6����@Hlzyc�D���!b�Zj�D�����Lk���&3��9�u�m9bsm��Ws����A���7�������D�������������������. ����P9`�uR{��,��e��u[�Rqr��C�Qu(�l�C���Ab�:�!������O�A�O��S���,3=t�V� �}�����u!�<t�Y�r���&�ly��+"v�����Ab�#���AQH��{9b��
$v9���{9f�d6l ��Hl{i g��8��������=^2�a5����%�s�f���]�p�����}:��
OQ�|��.���=�?$�,_)������&��
endstream
endobj
15
0
obj
6637
endobj
16
0
obj
[
]
endobj
17
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
18
0
R
/Resources
19
0
R
/Annots
21
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
18
0
obj
<<
/Filter
/FlateDecode
/Length
20
0
R
>>
stream
x��]]���q��`��������<x"Q_��08�^�>\�)�cw�����"�J���,�/�;g�:G*��XE��o�[�p��g��)L����������O�G������y���������;��w���-�2=�}�����c����Q�g�|T����a�0g�a������~�����$�?����M!�����"�:"��5N�������[-�%v�%���x$1�s>d3�$(O��h����K����~��O����Q����_7T�Q�x�'A�1��p��!�v'�O� n��4I�	}��7��)�������#%��KJ����$�C�����")5��_��y?RARj���$��~����:A�1�� y?R{)��>���(H� �G�4^��@�!�GJ��4^��j���)w<w2`hL�/H�C��������8��HIx�%��������#%��KJ���w�a,��`���2�W�����_x���>k��|��"H����d����_����)	2`pw5��C����'0x�@�!�GJ�^����e�^d�8`_ ���#���C�����_p�x�����$������y?R���x*{�8��HI8,�Rc_v=q���r���Rc!F�Cr�J�g9?����%	}��G��|��(�;Q���������0����U�[�G�M���_3|8�n��Z�D?�����&�m�Di(����;WXr~L|q��������@��,@8�	;�Lb5M�n��4w~��OC��J*�*>�Y�p	Q?�w�w�O������8��MB3>Y�0i�x��������Sp��)��qY+�A��J�d����u����q�m�%�dN9��"a%�L�e�g�����yC�����j2����T�x���2��k%w�^i&n�3^ZXcV��
g[vnS=K������2g�dR9��*c����e5r4�t��#����Y6�T��$C�5��,���m�-[l�V�z������|U���n���o&&�dR���L�X��-2�^x�2n���U���Z?��x����'V/m�:��F�5f��z]�K�<7������ML���r��U�B��k��X�::��H24^c!��R|tU�p����gP�����������>^��u?�
�sb��g2��M��K����>^c)����O >W�����n��qf��R���Y���S�:LIf���1�y�����AN\� '��KU�;�t���~��4	/�{JtO���I�9��j���!~OI��i�S�� V����C����TG/��-UX�2�������[{���{��"���{��+M�/�k�
^���th��B7]Y�C�?�{�e�6��{�c�2c!���o7���^uR{�IM"4^ci�+��t�zycS{V��x�YoU� r�=L`�f�Rs�y
c���M�N11�{��{:&���J�X�������~{��MU��^���
��X^k��}��#bO��dQ����QZTc����������01�&����,0�i����llnb���U�>��x���f����&���� m���i�����"
����)�)���AZx�:	K����^�]�j�"���5V�z�$�u�X�1��(K��D��b���8�&��r�9Y����1������������o���X���<9Z�I�&��l��1[�Y���S���W����Z?J4^c!�/�b8t���Y1�YV��V�n7-�&S�z9����p�4u9z�S�gN-�%��Z<Sa�����d���R��Y0[��&�"%�dLY)g���P]�����d����^������i��Y�:�P8��B��-����e�E�=e��E�R�G������4�Z�L:�x������gN�gx�d9����c��k��������!��I;Y?f��qWS?jAwP�I��k,/�5e�#����JY>�,�~-�� J��q?���:��qP��I��k,U�\��N��1��
n4^��T�t��d-7L�����x�������T�����+�<�f���o2�,�����G�5��W&�bNZ8m�1���x��,5�s|��^�s����dh��R�� {Y�%]����k�65�oj�����K"�%��k<Yg\-���eU�d�F�}��Z�6�{Y�D�.6��8�{1���50���"���^I��0�\+��H*r��O:������`��������T:��V�o�� �lc�S����A<1�����z9��z9��/�}���i�9vR[D�^L����qk�e�e������?�a<���y����qNi��>^c��F^dN�z��>^c�z-�����D�.���{����=�Y4x)�c��K����=�'�dS��N��[u��Fid'��H�_i��Nd�[(�/ew<�ds�R�v3;�\K�z�0��X��X��$�<N��T��x�G���
�,���{|�x���q�,$��d/S�������2�G��l����g��o���"�y�Y5opv��o�X��t1�d�>�N�s4Q;<;��t1�d�>`'���3&?��1��2��T����	=b&d?`��U$���*�,��P;�i�7U�� C~�9�-�4^�A��[��I3���q7i�x����^H:wy#f�Jr��TSnd�����S�0���$�(o������v���Th�x���������x�������,Eo��A�(IWo���K���(��K/l�6���5������l^^.��W��+���B�ef�8�����a���v�w�A�����,S�r���=t�^���a[
�����{��p�������mb��7��]��O#L����F�>��?�4����c����s7�O#�
�O#L������7����8���"�0K�L#���_������|�?��~�����/��]|-b	o���������?>������N�;}��_��������8���
�N�xz�#�5��*��/Z�?W����b��_��+�_���0_b�X�<L�m���?�������K����\K|�e�/��J�qY���u�9~u�U�S7�"�_�~s������U�1b`��T�H���A�(�&~�&����-r�(3[�f>����h�������_W�6�
L� V`����������N�S7�������c���|��-����+����5>�25&>g����_���y&��[PF���U%=k*���L
X�8s��.u0_�-&��
�=�@\y���������~8}��1`~Y'�f/�_�s3��Z:k�F07���y�
x?�O��55��������)
���&�V�����o��
�k5�0������~<������+}o���]���GS����gh����O=���~rXk���Z�����*p��	?��eS�xY ����S}�t�������s����;��K=o����y��i���S��O=���3r�)�ZCO����@���n�CO[C����8�������
-F�X!�syT�
L�W���~�K=o
?����P�'Tk4n�=����bm�
��xV��ve��+��YSC�x��������D���� ���ID�L!�������o�f���y25�{M�s��S�5����C���T>M
��O[b|
��:7	>���k��0�6�ac���3��e�D�������r �KF4?��^�����X���9�����n��;M
��N��w�c�)rt���}��K}����>[W��������'���
��~�m�A�t�v�+��&S��Ix_�wpd����y�y�5oq�k�������X[$����x?��_����K��N"n�H�x[,R#�I�~��%o�;%M����
��i*�5��1�kj�qM-�=�`
�4�6�8�kO��������=��I3�b@���E�
22`m
���x���m
��CSK����w���x[����A|���j���^5s�+�������S}��<��!������'�i��6y27�C��U�=y��iJ����6k'.�ZY��dZ����K��Bl�� �m�6�x[���x����m��.?�q����x�U�k�)�<��!��5����)��_��Z�e#{��OT�������0��|w�������WU���������g&u���J"=�?�������[����4?)�w��Q�w"���by���9��D���\�����v��}���������d���D���2u4Ge�Lo; f�|%Ge�[��9Ge�k�J�a��F:��\�s��7�J�f��_�T]�;-�^Ssp����R�����9�{*y�>��s��}��������P�3��J!�9��6���Q����3�����6"�n�J������9�f�,�fC�<;3�m�J!�1��]��_��s���|�TZ�D0�C�1�����|�d����^��������v�t��J�I0�iB]������5j�����p��b��������1�|��b�E���=b��]����������m)���[��!�a�]62`���� ����I��h��[�EW����M^�lw��xu�PG����"����������G�m�-V��QJB�
��I��L]���G1�����C��w���V8�S�~���F���#9�t��:��v���t��f��v��m;����������\���RF��)i_y�n8�/�[M}��y���$R3-@"�V�Z��5E������~n��A�O�j������`BJ�07L�1)55L����4 �}��-��rs��)bn�v��su���G�3wK����]�f�+�����
�	��0o�n�X�kY&���<���YLjM����'����{2 �9�vB�=��-�'75���1jw1��bn��1[^�O��������Mn�z�����6I��Q�r\v��-����aw��1*g9����0o�br<�ky����r����a�Q���h�r\�����x�[?|w��%s�����E��}r����8Lju�j�+�abk���b��kP����]�����5���g�j���e����x`m�mm���x���4^:���m�O��l_�[o��-����1��!k�([��5w�w��`D�M�Dlu�[o�}q�p����B"��CD�(�a�X�����y����������VV3���Nw���R�d�2>>��z�����Zd(Q����Z1���{���;�����{u@s���q�\�g����|w[�p��g�RO���`�������2>��_��V(�m����;�����
X��4 ~4���]PSJ��t� ��
��MJ���-;��K�&)�6H�Myw�}�~v�?�\����"��[�7rd���O��:�_� ��Z6���Jyj���EJm��R*���TV�t�-y6*�u��������J�@��
^&B���d��no� �h	�$S�T-0�)��������
endstream
endobj
20
0
obj
5491
endobj
21
0
obj
[
]
endobj
22
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
23
0
R
/Resources
24
0
R
/Annots
26
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
23
0
obj
<<
/Filter
/FlateDecode
/Length
25
0
R
>>
stream
x��]M�$�q��Q7�����X��;�WC�����������,��;�2#A>V�0�4XL?����� �df����5������0���������<^>����y�����?>Rpm@|��m3���2-��m��%�m�fU�����?����0������O��p�_Zk�m�t�����"��������|z��}�D���%�;���������L����QI^�C�h��;$.��kN���izI�bn_�8�����p7h���I\������0j���I\������4i���I\�����0�A�>�G{$q����n�������iSL��H]�$v%��*�S�+�<*�G�e�e������Jb�Q"�����~�h��8-S�%��0
s3�������^����Ern��r@�������37��������7O����A3��n?������j��95�f���8�N������j��}S����C����\9:�y`��:��Nj.K���������?JY� 0Jo�"������5�$Ks��0��X�'����,����x�a�=~�C��'�;�!�6�S�k�'T��'�%=N���>���%�c�y[�9��0)�a�����J�8QB��IH������Dy�����S,E)����I{��G;�0n�T�@�5$f-w`���)r�@���t����U_�,a�	k�~u�y1Z��dQ��Y�%�P�'��!���E�^�w��G�0>��K�*��f�{`���)��2"���#�U������J�X����
��0���E�^[t/��O�e%%�M,�u�_=��)�[�7:���'��o�PR�8&�0��b�(C��GH�z�T	�
�Oqn���77������N[��Y�$)o���$�^6�[�	k�8h���j���J�T�}����g��-u��V��L�k�L6��>���������VyU����|���P�&�,���8���>�R�0��Z=K��������j�����eQ��o�I�������Q�9Q�0�P�k�����������.@2�_[��������)���yn$u6H�]�*aR��)_�`����g]Oex������A�:�W�	k�X��9,���X�0�P������S;^2
{��%3,������YZK�l���e������QQ�Wz��e	��Oq�:vs9����q����)���8)`Y�S��}�k���w����n���U��)��7f�t��������.�2��r����E�[�r�E�}��I�����.w�~�_�{]��!;[���a���t��a-���X�q���5	O�����l�E�m�flV������X�����k���C�^�(����QW����~��^�GU����SF]F�/������@zt���?�$��z/�Zw�n���Q����?V��)����5�"*;�c��
~�O1��5��2�n����Y�h��D��"&w�.=Y��Q$@�E��nM
�������E��,{`�9��8��1Zb��%#��C��w,z���4��b���$�5��Y����m��j�F�_�a�E�u�E�Zt����v��,��.wY�����Y��.ww,�ty��N�1�\�``Q�O��Q$`�E��A�������wY���X���B3��]��s�a<�b��z{�K`V%��0NT��	�([�+`��)���nJ�d��k�{�N����o0d���Y�p���X���K��l�.���U��T*�����g� �q�x�{gS�;�'vL���z�� NT�7��&H�-��w� ��w,�����w,��[7�0�U��W�fY�q_c-{;j�-]�.����qQ���a�:A`a��2���E�l6����.��J�X�^���,]�� �A��_���
����q�U�8�*8�C�T�q�0fY��.�3�����l�.�Y����k���������.K3<��w���3�#7']�F����j�;�X���t�:]�f|�+�z�����f�u��bj�ns�����*4�R��� �r���4��U����X�{O��S���'����S~�`F��
q���^W�n��ja�D���e	��&�pQ��
>����H�z]f�-�|�S��	,��6d�}��{v
����f�{�Q���p��r���4������+�8eM��.F3�5o�(w�>��c���D�S<�=���(f-w�x8�Q��NX�{P�Q&`<�.����r�bx�;|'����i���������a<7���YW��J8n�a����������|Y���x�7����AW�;�����u�O1��&��ei��6�.��Z�e+u�e�R6�g�]������e�5	6C��;�����p�������([�+�(0����J�;��eU���_3�|��v��^��;���|�p�����cX�
��lT	'�7���a��eU���a3,TK�����;]�fx,�������_Y�����k��%e�t�:/�Z�pAt��q������*�g]��8:���#��}�zH�?�r%?g]�F�q"z��Y4��u���#���]Gj�F�;�v���p����K��v��2���;�_X���5���2���e�2]S�XX<�2��.��,�[��dA���3,���e]�SW�������E�btm��X�,��%3<����\���a���t)�a%[��_��F��aU�E]YF���Vm/~J�-_�&T���dU���%3|�����2��r��gic���x�VZ�=:��2����:�'Y�-���Y���?^��^��,�]�K�(0��?��G��Uv��kKV%��0��m�������+���SL��;�j�.����;f��}��j,�n����[V=pT	k�y�����+
P��*w�.oC���8qX�P�BFlQqGQ��	,:�v��}�U������tmE��s�J����L�+�r?��:������������fj�b�z�C�t��ca.o�uw� �y\����7��{�+�e��<}m'Y����w�{�����;7�����|n�i����}s��*�����mo-�@������t��i�t����C��Ow���� ?�!:����t����|o�$?�����_��~����c��m��e�Bc'Z��_�������?���|yJ����������S��5��|�|���:������?;����_��tjB��}�:d�Ma�!����OOU�m�w���J����_�2,-f�|z=����5���?�������f����G�����������b�w����:��s]���9�o�QW��1)��w��������>�W����g[�A��������)��"R�D�jcz���O�*M�Y�8�A�@j
@��F���� ��k/�^��A�o��a��w'>�M���A	2o��S����m��������T����]c�x�~<�������<�j���;����s�t���$��u;����6m�c�����S�=�-��y�U��%�y?W����#�1�s}�k������� �3�m��!.�]��MS�O{�]Nm�"�?;�����
���	���^*^�s�,�����Z�m��
o�����u|�0�d#�����s��Kf#x��+�s6*x\�"��,�Q��J�n�b�r�+~�/�b��O�e�f �����s[�A����b���+"4������X��%�x�Y$5�k�b
4�g����\;����j��o�~��V0���E~��O�K1��k�Z������9�A^���-�)M��fX��%^�����^�s��/2���2��,3P��^�e����|� �|Cl������b6����� o�z�\)[�����M���J��^`���d8�������^�GC
��2��!�BE3�[�f��0^���$ om�9��k�Op�!f��C�����"���%>b��������+�Q;�������9��O��];t�uf{�!����c(�:t��8�bR� �X]b�����?x��^�X.��j�ex��=b2{���� �y����|�'`v	<��\2� q���<W/d���!{�MC���6B���gND��+�
<q�����H������E��3��!M���i2Mb�4���L3�����w�(��86<�g�lH@1M@���L
��|�0`�/�D����CM2���e��v�<��2�
f�W�0����1��.En���\�=�\��P�����q���A�z���~�����}�|�����a�
?9���� D�G��y�����
���%�o�)1x���2n�W�J^�Q_3d�A3>���1�����A��\=/���������~���~���O[������y=]��LOW B�r;c=��X��WbE������X}���+���W��Q_#d�,W�"�>�����+�
��t��z��F��8�W����]l�
��:�����K�|
��eSL>��%�^���H>#����a�|\-��4a^o��D� �L���.�|C�����WX�-m|�}{C��
$5���Kj+���U��~��
�~�8��k�Lm��"�,��T��
px�_������\���`�\_F�|x?Y��2��'c�A����kM?��#� ^C����"����3�������?��� ��r�tf�b�YQ.��<9<2��=r���Eb9l��c�G(��r��o�!V������H�7����A._#d*8a.�JV��?��/�c���=j/�k|��0>��'bv�9��s��p�B���2�\��1W����ks��-gJ8���k����sy���0�y��.��u)�����F� vuBq����i���7���Fx��Wl8��Q'2�G�!^�m�����N%#���"�JVTf�~j����qR3����v��}
�2a��@OP�<�f��S����C5b��oco��m�����	_#d�puB�
_+x�]�7�+�z+�����Z�qx��O��u��u��'&��D�����������fXO~�����i�\b�>t���l|c�8=�����<�V�h�.So��������r��	9�b�A .w��	�8�B^�����K�����]�]��q��r�8F\���Yp�>k-Xa�������;s�:�\_1ML
>�?4���1=+��
gl��Lb�x3)t���j���Q ���[vj�|���c�6}Mi��r��<��e��t�{a����K���0 f��;�k��NS���}���J���o��/r�}�x�"~��[�����A���"6�c����o���p�Af�+��q"1�NO$>������������W0bR�<������<��z<|���_�h~2��c��\M�9���=�A�-/���0�T��&���;(0��,�ZgY���D����,��/6�x_�'2b�����c��\M���r�Y�@�e���h<b��;1��{l�+�/wc���n�.�Cc��&�����o�!V��x_��(2���ex�7�z!�O���{����1I@/\�����0����s�����~=!�}��p��_M��s��z![@8��fX��`/,�3m�a�g�{��LI���gJ��/����a�z,O �����d4���������2�f���a�51�� "f�a�Z���QG3��0
x�{�"b��M^�����W0��M���P����l����K�����/�xjb����L�2�t*r���=����~�N&	@�jDE�p�B&	x���F�{�_{��Gg�3`vI��x�s�:���]Fg���c��^���W����!T4���z����q��ab��;�K���"bR� �X=J\�[?n�X��OON��0�%3j�t����_,��X�^x��	��1�b^� ����
���cE�.�("~��[F��2�
f����fX����
��C(����N�+�B�ay�C�nw�X:�s��8��@�[����7�_�~�zF~���p�e:���D�����7&�o��<?~�x_�|l����}�EZ.����sm�����c��LT�<c����"��7��dx?Z_����:6��]��/oL>��61�^�S��F���K,�O���|��������0#�%��91��?9�
�~�.I�	Uh�JGLj�J!V�w��x����[j1�!�gC��l�k��f��1���k+\�f�����Ja^�Sf4�B�S)���������W����h�H>��%�x����kN>���>�D�q5B&_b�|
�����9<��y
/���0�$��1�B���G,x�!?ifz�:-x!j��buIh^���}�_4y���w
��^�����y�1��V?��G���_��?��VW
endstream
endobj
25
0
obj
6415
endobj
26
0
obj
[
]
endobj
27
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
28
0
R
/Resources
29
0
R
/Annots
31
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
28
0
obj
<<
/Filter
/FlateDecode
/Length
30
0
R
>>
stream
x��]M�$�q��P7������}P;?���U��������{��I�,�2$��w~1�|��fD�`1��X�e3�2���s����??�~�0���������������������c�$�� �-��{��i��s��3K������7m�?���g8����~L���]�7������������E�����m7����	����O�f5�O�}�����*H��.��$�;/$r�_�p}+$6�$�e�c{$��1}e��AP�x��
��s�_����v����R"�{�K���"����9vW
�}��H��K���/t�����D��ihE��8�G}�B"���n���}�o����v��8�zl����%�>$e����9�R�X��v��x�$e����$�IJD�OM�9��/Hl_y;RF�7R"�M���W�����4��q�����)#���x����W��������-��v��qh���qlAb����2�~�9��/�k�i����k�o���4����:����O|��i�$�������������,G���5�����;���\_�q}�;/T�4�F<�8���m�i��q5"���n�#y�[_$��I&`,d����i��l���������Bv{D�l�Q&��q���M�vyr�����$�;N�������8U�7��9'i^�	Y�z��L�r�^�w��y�I�74��9�w���/.9
�$MA�x�G����9���h�&I-�����N8�F�N����3Q�n�i���V����<
�NA�w��l�s����,]8fO�|!�2��L�p��Y����s�6��p%���x	����	�]�[m��z�3p%����Db��;eVe���h�V����f��As�2q��m�!�8��Ae"�D�B�������N������x��Is�ey�B�L�I��4�w��9:n��`���i�!ku�����%���P��5r��g�w�;R��;i��{�u1������6��}�}�u��b�6
I+G]����&�������7����-�FEb�������}jK8�Ln�/����s�G{n'3'��W�j�Ru������T�6�< ���E'FY�4�
$�q&��>��u��D/�=���Ieo�������
�����m�At��Y��4�����&��k���X�&������Ll�c!{m0�~D��h�$0���9'�R7����s,
|=RG��c{��
������eC�=�nI7x������X���,]���q���������	Y�zt}�������AtnT���+a������s���l�K}s��a����1DnC���.%������H��X�N����4%�f��Q�zPd�L%`,T���~�[�c��(J(�I$���������O��^�gv�H�8��iyL���t$K6i9�	K�����@��Q��*W^Ieo�c�zX��D�cf�N��9n��s�$�d�D�R��5�$���!�L4g��{�6q ����d�9�sp'�a
�ln����J&�IxR	U�y�xm�#��IeN2�}�;���:'����<�e�}�N��ivV�P��d&�e[�I&`,d}JK�b������.b��"�Jl�cwL�/[w��x7�D��d�}�cM"N�dL��'���P���?���v;s�f�]��J2�}�/l��}d��M2���d���D�}�='�rw��oR	K�����l+���rIfo��.���S?63!c�\x�����}���m#��-e&�D�Bt��:�� �`$+�$�b���d���c ���9�Rf�T;������c��d��2M2c){��\��Y�@�v�0����=��c����|�e�y��h�����bW�t���H&�������d:J��2�<`){k�f�}V���2�<`i�+1�:J&�=U{�x�I�_*���"�>+(!��>�L�=�e��d�d�R����>�|����~��Q������gI�"
�v�:��}��}��I��)��1-�'e��K�
���>����9/�R/��7����_&�R������������\+S��Y����"�b�Q�RV���8+<�������;�e~}���R��7���I6�^f�,do�K�Uu�����2�>��&�v���]V+��_p/R�}�u�E��d�M�2�>`!s>1�Ll�&��l�e5mn����[e3��+�n����J�����<���x��T���h3�l�l�^&�,e��Z�=6rgE}.+�s^&�,TK�������y�r� S��"�M��i>�L�e���
��0�|=#-�3\V����Fd�}�B���8�$���i�b�k�r�J�IxR	KU�2�S[������j:]V�9�
>�]8�������I���$|H�l�
O�G�7`�uf^�%s�<<�����kFCV�:d�C�d�X����(Mf�C?H#�]/�{����[�5�&��8�F�6�0��W#�d2'�����L��E�a�	���u�L�x9���t�J���pR	U^RA?��
�%k��x�2>`&{[��2S�0�n���{���V8G�dL��&����@{e�T��2Q�LLX�*�i����{j2��n8�6t�io�����-��Q2YM����N&�9�d��]c_�l�t�a{�s�Bm���dq���N&����3I5x��c'3����?���
�w��2_$���[l��D����i`�H����>����\|d�l�Rv���X��P�&�F��O���9f�7�u�c+�;���qc&���x�6��P�����<�G�s<��8��Qf�$��w��s������L�6�n6�O� ���[D�;9��0�M�wi�y��hy���s�t�����oS�h�dl�L6-�b<������'9I�pRM';�c������1t�L��c�7�E���:ur��0��r������ed��p�Ht�0���P�~�d���;�2�pQ?L[`!U:�)�}R	KU�h�#*]��f�d�=`��h/]�`R�O���9������|�(�1��':g���ql��(�T���&��}�3����KV��uI'@���8G�<��af�����9n�
�*��\6#��>��\���ss�v��/�u��?�?��9�QL����Y�:���+$��i���O����n�>uM�?� }:4��A�tl����b�����]���U(�}�I\�eOr�l�ilDs�����������W��������I��el�C�c���i������<�������~{��_�_�r�����N]������:}��
���*���/�~�y�x�fNl�a����v������~�@�q&�PG���-��0_���gu�n~��T`~>}��]\���4����	b?G����;}_E�~�:1`5q����������M
���@\y�����j3'�����N�M��~9}�����5����2�\��n� ���������!���+]�|�7E���C�3L>�_���s���<���u�<��9�L�����R�r<�U���Z=*CV�Q�UG���z��#��!x�0��S�N��<|�7��R=Qe�����<;�nL�S�;������
��d�~������{\�X�q��c�;���w�_�>�_������s���q2Of��a��8q�W��B���i*sbSCp'6�wb@�L�_�Ny
�Y��<�Q;1�}��n����N�l�s^���G����!���Z��\���*�}�6!o���<��{\����m8l~3{����~}�����~u��J'\���o���u�>�#B��[�n��C�?��u�����V�����6G�s�.��,p��Y"VND���u��&+��5�g��Ku�K!�>��ZRHF�O���x��t��`V�|lyS�)��������cj|l�Y�)�� VL`��w��/��L?.��&D��6;Y_��D�u�����j"����F�o��0����~M���U��d^U�U���6oD����B���yh3�0mbmE$�P�i^yM�U��"�E�&��~���x�m�!>��$�x�����>�WS������y�1�0>b}��&��>�'_)�f��m
>���`e�~�����_�X�U�<�� �Z����/�j�
��D�~�()�5�;b}>���|!^�Zl�l��x�k�w��{�q"f�+�j}��i���\�X�s%ST!�C���UO����E����p����70�v�D�&N\��f������!���vD�9����W��\��Wg1�_�#�p�JK�l�q^��^�"-[Cp�3��8@�9�@x-V�0�>�E���������j5�X(�RYDm0"V��P�UO������<>����`
������0[qb���z��p�"N�l�s^������Z����!�����J��\�O>xm�����A^�����e����jN�j��%^��������`N\"V������0�}�O:����z�}����-�c>�M|�����"^�d����
���*"�8q�y��"f�}��>�����m;�`t����= B�lD[��X-*��yS`+����l(S�L���2���@����X��<F����������X�G�����t��<�lzlyS�)�j#�M�)\�>��Z����z�����Hc�$���� f��x��\c�"��v"�X-J�K��hV���� ^�Rc[C�hfL��Y�X����?(���b;ju$��Z=���y���W�!^����=�����E"�l1�2�M����M����+[C�HdK�"Q�X������9�gV%f�}s})$?��>���/���_�c��h�Q����IH+��C�5��5�<b�b"~T,����ZTo"�g��x����v���7���x��3���R����
��5��)���B�������-�����z��9��-��T=hV.��W$�����!����u@lQ��x-^.�x�������������
���+����x��T�k�l
�<�D���"bMeV�8�k1h"^�C�_����#CK�4c1�4Jc�~DF�&��������7�����>>�eR��\�*���2b~��<)B@^e��0��r1��\�W��"^�35�9SKp����8���Hl�}�\��,)9 ����jq,�wxS!&U�������������M
�=�@��k�E-6��88��-1!^����p���M|�����^���`k�s���>W0�z�
x�f���d�����f�,��=���r�����w��������1�Z`XE���@�K����7�������v�$�t�	?��"��j����uS�0��&"O���VH����9�r���/��r���G�2��z����IF�/��%��1�~g�*����^�u����)���O�M|P��������4�����=@�,a)��uk���M��
��*_�7wx�&U���f(�VC)��g����9��\4�~(����X���0+���3C^g�������1��6C) ������@���9�����9>�'����9�7�P
�mJ�����Ak��~6��XU���R^@]�:�|�>j���������>����q�Q22��"E�/��F����Gi22b�7�� ����]�T��V3���V���bV�U�h�����U�(3r��vK:u�Y���
����T>(�%Dm�y��Q3�a��b�"y`VJSc�G����5�yfdL�8�AFF��Ai6����I���h������Z�Z��Y=(M��W�0#��5�_�BF.RW��iPB��Ai�}�F����1n6�������Dj�MHDh4�a�hWN�C�{���[����D���E�,�tckVR2F��"-�`c|��Ed_4E����W��j�C��M�������� ���6qJngj�vc��C�E&��&ny��	��A^�94w8F�I��'b��c�k2|"����=IK~lk�����������c�����1�~���z��u�K������^����}��%��v���nW0�z������n�n�-�O�k�v�x����?�!��
endstream
endobj
30
0
obj
6082
endobj
31
0
obj
[
]
endobj
32
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
33
0
R
/Resources
34
0
R
/Annots
36
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
33
0
obj
<<
/Filter
/FlateDecode
/Length
35
0
R
>>
stream
x��]M�$�q���7cgzv����3�����02����J'� l��`�_Y�����������b��f#Hf�_�������g�q����������u������w��8�����#��w������[�%>�}���C�����u8D�78l�l��0%����_Zk�m�����_�N��?O}���(�CX��p��Cpo�cn���?�:+�x�!I��C�h��;$._y+)	����������+o%��c���2���.q|���$<vZ"�G�k�������^S��hM�������4e����$._y+)�7��>��I�����VR���}������+o%%�1j���I�QK�QK\�	��u�S_y+)	/�����IP�Y���D�qUG�S�+wK��>p�T�(��W�JJ�A�w7������p����f���VR�u�(p�@������W0
<� �W�J��9��W$�������QK��������0����P �����\o����nV��cjEbZ{%��$��UQ����%��5N��I��:�s��Os��z��v[F�n��:��E��{i�7�3w��N	���K|{�*�����8�ZC�2���N���I�L"�
kY)���2�G�;OZw��.�Y�yR��
���u�0��]g-���Q�^an������H��t,`�/Ru>��������e\F	Z�X8`}d�;j�������������b�^�.z�Hy\�������5���	*�X�@��P�wA�^��%&��M���UL� ��q�������4���Xj�c-zg|�h}�!K�)/��9�������A_.��)5�����#�n�#h��G}�������9U8�CK3Z���/u�
�X1�� �gS;�Z@8�p���}��+�=��$�YK�����XH�uil��MJ�y)C�s�u��{�I_��&UQ{��=�(�3Wg���9���	ws�b�����cT����S�M\b���cT����v�<��jg=��U2����j��V=��U����	�b�9�C�'���'`�����%=�CV�E=�x*�"?�~{D�!R��e�Y�
��$���&�u�����R`�{st����G0dU���_
�b���4���
^���9��z���<@zj��P*;�}����wh��|�t��N�8�M7���c�Or@g=��*�Tda�U��,�R?O�;j9�-9�$�c��(Y����j�Ts��G��� p��J��}���LZ*c���<)�����9�J�����#�Wu��
q�)VTD��P�.�^
f��}�3�1v�����L@��� �����0�U��[�V��D�Y�x��Q�����jV������_u�<J�U%L"�>�B4��2��s��N:,��r�;�3Ve[.��ic�zc��m��X�0�P�kQ�����u�;F�����v��l�O�O���J���pZ��X�����/4�Gu9�s��K�dO�$L�>�B������3�'��Ng�9��_y��*[�S�M*+���Ls�����
b�����T�T�������9�.T
����Vl'����,J8-^a�Eo�v����bU�i�
����y���Ai1�"-��(+pPE��u^��^�dQ��^u,��])�����Nkf�ZEZ��h�"��0rZ3�xTi2-'��Pd�,�F�k�Id�X�r���l��-<�]O%���0k����!N����0������U���N*+�Z��H1�*�U	�
���t�:�����'����>�����6��KY��If�8��/1cQ���JY�h��{D����e��S)3]���e������$��Z��A��r�@���Y����<|�8M"+�Z��b��?G,���+S��9��"��*�G��;�9�:������N�Ip����G]�X��x�o�5�$���K�$�b�E�B]�B����UM3����kRV��9�l����AUhX��S���
��=KC�K�$z�Y�����WnxTt����g]�X����[�Px��H�<l��ej�c-{#���L�H���������T;Q���cd&]r'����.�V��6F�2���U	���.�R].��@p�����Ui']u�
���TT�b@���i�5���i�^��9��_U��([R��Id�8M������lJ]s���>��,^-/����2|����.��7��,��exY1���"����Y������r���y�����Q���<4��aQ6����������eY6�.�����2*C�u��ca����q��������+nV%���0���b<�J���J8��a,T��3�Y��;�M�i��.��s8,�v�evX1��)x��"�eK�B�u�}��Jy��o�J�����q2���� ���������N+�RP������%U6������1�������jw�����n���5�J�J��e9���k���Go��`���vt[`pEi��\�\]`�XX0�������n/Q�et�����.)�������lL]b.Q����7r���X�.:w�j�e�Ef�cm?�z.�t�Y�x�uf��.N�
AV���UgRY1��T�d��,*�6���u	/��,pfeq�&��b��g��������.4sL���u=�f��gR�1��s���I�2�b:�#;���E�DV�sQq��J���U��,K9��s�t��������^;]��X�����}��;���}X���K�5����Z�n�a]���A��k�������{�n��������?�����j��������(�t�@��H�Ow��]����?�A~�C�4v��t����E���C]Q>n6�O{����l��.��;�wPTO�-Zc'���_�l�~��������������?��l<���ua�)v�a���]>����:}��O_����/_G/��!�/N�>}�����C���5�_7�v�,���f��?��7���� ~�F<l��.=��|z9����s_��x���oN�[���fx6��B�8p�������k���! ��!�������8
�����8D�����#�T���~>}j���l���m�F���,'~�x;�~)}�Hm�6��a�����B�b��u��]���6��� �a��}o���[����S!5;`5��b�|���^�����r������!o�C�9�~�����?�������-��+(�6n��f7l�6g>�5��bmO��>�N[i<b���7������8-a�������F���I�z� ��_�H�����~��"C���y�A����F���3
��m+�8�G���
�k��������������#��������'���4�F��A��k�����A>��'��.�y[�����iuB��	��'T��>H�������I���	��w��]
"��0�	���/�6O�����!�G�CC^�����f��>* ~6�������^\ZW�H����`�W1�'��z~gx���m��M���A�cX�,J���5�\��J`D����<��3�'�� �}�j��9�#��pO�{*���z��������������������)��I��h-I�5@���y��9c	���02��d��U���>0���b�1,�j�q�QzMI���������~��%�yJ:�}�ddD�[�������FF������W�)	��]���a��0��k f��4����������
#W�[�L�b�TIa
������r
dg��qf:v9�yf�������~���	�o
�~;�����,�X
���Y�x��@��K��/������<v�����&_��5#7�~��j'>���r����S^�����d�lX��d����{>��^�>����K���!"�}i���v��[��jc�\�,#P�����/��X��9=��k���s{t[�k������l��G��ZS+����G`C�.�
w���T���E`��u�"�d�>�p��96�|��
�sk�H�b�H�0ok�I�g��^10�� �������@� �aCM#b���e���O�A+7�X� ���������Z�a^�� ��|�0������1�������i)Q���A�.1
��4��1
v�C�N�=��������g�cR��f�$0�$)����#r��QO:"f����x��e�w��U#�b{�Q1�yA?_�[����f��./{��[]$-{`���L�&����bR�4
X]�Q��������=���-���]6hp�
S�f�|���>M���e���i4��z>1{8�uY�6�3Zb�x��������@���r<�l(���0ok�,�P�<�n;C�(��\Dm�������R����Y�f{�����2LT�a=@��=|����v
�������&��.�3�uI����C��)�
[��bD�������1�J��������� ���������������|g�x������m�����������a�
k���]m!]�,��aN��S�
�=*^�k|E��=�
���cNE�.���������0o��av<�c�g����]g^v;�k|Q��I��Eq��������y=I��LOR B��[�z��~�����*��YK
rh���������q����2t�ZX����[��a�B��=c�S6��P�s&�
a8�B2�qs��B2���k����s~��%�����x�5��"�^��B
>C�7�k���S1q���D����M>��P���
����{�|�]l\P���^��S)���|yS4�����f�����)��BF3_b�*&n|�DD3@��&��	�a������o���)Z�f����<����F�0���"�Tx�����kxPIW��S!��R�� ��Q�^K����RI�����v�9_K���4G8��u��y9���	D������{`��c�
�z�������!D�Y�\��La����X�A�GJ0ok����|"f������uyh�f���ZB�\��G&p�
�j'��=nP����z��u4��xs&5O���e��<��R$7v5�t��%��05S���Jvc�a�G����k�:�G��D�.>x]|���>�j�sK�}~�A�9�a�m�����m��xLb^��5�|(�
Z@�0+V�m��\�����!\
!#��%d����<+c�0Dz���<��ma|Jb�u�1{x�m�w����q��!��9}-!�sbb���f~����Z���l*kk�����x�!&�N������G��x�s���j������)�#�{L�5���fv�
sk�!����Q*�`v�;����"^�y��w���~W����*�h$N~;ly��J[����;_����e�'��'"��@"�!�S��
b=��X=�9�x�;lw!�����i/D�������XMW#��	�?z<F��lx�4��A��u��Q�py.��~���H���>�b6�B���K�=�7E�.o1��rDm8'"P����@l�Cky��c#~=C�A��7 �k'J������k'����hq���ELj���G��x=��u�����`�r�6Q���"�!b�<[��%�:�b�������i!f�L��L�����uB�
endstream
endobj
35
0
obj
5662
endobj
36
0
obj
[
]
endobj
37
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
38
0
R
/Resources
39
0
R
/Annots
41
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
38
0
obj
<<
/Filter
/FlateDecode
/Length
40
0
R
>>
stream
x��]��$����E�	�3}����wf�.t��B����$AR	���W�iQU����SI�L�t��8��q������'�/}�������s�6��G�������uMW���
�"����������?�u=���N�����4�O��sX�?���v�������h��={��v������n����A�L�a9�5��Oc:���?g(�z���\c�h�VQ�q<�E���b��b�{����C�b��bF���xD�
�/���D��i���i<��B�/?!�8TAA�q<~���B?s6KF����>,��M���t���1�W��9�~�`=VGt(���$6=��~Z#�vx*��b������T��g�zY�v��$L�4�L$MS(���v������A�����oW��5k5�s��8a���A��f�1�'�,�����^�n�%��8����8yr�:�������v���i���Db5Z�Hc�	8��k=��l��U%n�%1-n������7�/\��f��xh��S�5M��}�YS�*����M����$m���p�Y���-�:
�?�����R8����V8����x������LIm�Nk;uR��Z�i��t�~��������6�����V�����O�
?OR�HJJvRY&��O��T<�a��Yd������K��X(��Z�4�z�������Rc��1�k��������K��Z���O���H���l�������eM������A�_fOm�"7��H�5�R��^gO�'I�#+I:h��IK���U�ZN��_�c���r�PxZ�4=�\���I)II�Q+L$�����������Z�;�pd%Euw�X��i|��:�Vs�Gc�C��K�Aw�o�80-����kZ��R������8����X�^���
MA+<�0����4����4W�vm"a�r��	�%��=^��jF�=&���4Ni�50��a������nq��
��i\Wo�I���nRd
�f����Oc��]V�v|e����R��xcL<��wb.N�x.��+�J��K<>������E��w�U.u�����+=��rnl��������<����^y�c�9��tLX"���,� 	8V����vt���W��H�*��$��4�W��)�K�(��F�������i�+�v�w�����t�G����&�=W{e��7��2��X�N�u��6�hY�>)o��1�>��567����_�X��8X�O�Q����8��M��{�C�8��WG�)7��1&��cE{�=�g�G�x|����~`��c�s�)�_��c����Z���3�3K��f����M���R��<8��1������=�n��s����wS�Z�V[�C\+/������+i��7�k����%)�m7��8����X��6�<�h���B�^��C�*'ze �����HcL$�����d���@RL�n��q?iE�X�;q8jz���X�;j}���@�HJbj��$��t���U�,V�md%-��&�x|�K����zi���o[i�����/�dR�S�o&	0NH�s�(��8������~|'��!w����M��m��N�&(kz� �F��6�L`���,r�F�h���6�i��^K�(�6�m�
q�i|��/��Q1m��5�����X�N�Vv ��V(�6�����m���!���X�i#��B�A�C���a����,�����2O�qW�E����^���<���8Y����q��v�[������<���H��
�s2�O�c�Ay�k�4t3��$�O�ab������7vT����$�O�bbE+�#�Y=���6�[�������i��t'�2�b�;�Dk`,)�����m2�u��%�8a��6N�e�w<ucM+��������s���4����Rp%�ur��!R�TO&	8V��6u����R��v0��DHbj������/����G���X�H�P���qS6��a��{���p�x'q��/���-/����
q�ho��������O�X�{u����~w��������S~���[TK^b���
�!V�S�6e������8�����v�6��X_�xp�b�W2����v�v��X�d���))�]/�+��]{Z���'�6�d�f7i�{����8j�M��v�6�}��nwz%�^���$���]��i�5��Ub9�0��J�LY�,[��e�}��nK�+���Z���}�=o�j�{�+�i���v�X����[f	8V�#_!�]z���������[�pq�eS�$���[,$���=����WFXI1my�4�����X�����aG�%I���{mqq�-�uq{m��X�;j�{�{e�o[HE�$�v�Lp�H��EI*�~�4��v��X_�����C�b�o�������Z}����i���]�����[S�v
�
�%M�E�����x�YT�
,����[�"�.���4���a���q�Z8cMH��c�	0���v���*�U��K�j9��5��*����G�����I<>�k����C~�s�Qs�1M@58��7v���n�6��p�X���rml�H�j�:�����^��t��o�bqO��L<>�[�U7L�2]5���W�cAw�02�mW�6�<��cM����9."���-��3m8�(-��}�������S@��4��}�uT�i��i�x���k����������p�X����1k���t�AFm�������Q�-�Y�x|��3�6-L���6���1�X�J}��SR��J�j{<m���Ta/.N�A��-f��F��4����b%I�=f�c�z�
Sb%I�=m��8��+��`���[,��%>�u(�X������p�Xo6pD��G;��q]0+<v��n���A;�C�)�z�2b�K�^u�7=��5�t�ps����[,.u����:�<vG�tq�=��v��X��>�?N�?n1'�Ti�����"�?q���<2G�q��~�z
�$���D7�W�*�qz�7%�|bJ�-��D�����DK�j��4�����J�i�H�1�i��D��h���BO�Aip��,o%V�O6f	8n�a�g��Z�=�R�F[6��1O��X_�F�8�1u�Em����Pwj2���@����P�p�8�%�FR�S[���mp�/�z�����������Wa���x�3a�n&^U��������<wA'>��Fu��N���4��i���a��[L<��i�����
f%9��a�c�����su�j��o�u�����:�IU���b��z��{*�������n!}�V��t��n.���-�O��y�n!}:T��t�[O��[H�~s���w��&u�[�� )��
h�}��>�|����~���z�����`���.����m���G�:�����/O�=}������?{�2��o�6}j0����N��p�����q�},�
�����7�:��on!��z�eR�U�h\v����\�Sp=73�U~.nf)�s*?��/��sa
S��?�>�����K���R�nn�������/��]h�����[7�����������~u�j�����E��^�_���i�<�DM�/��R�4�8?
?:=�KOF��s��?�WaQz�����>��}���
_\|�HdCQ�"p_g�2��tcp�����.�	��$�Gs����Oq~���hQ�4����3pY��������P�6Woc�O��%�~|^��oY8?�8�U����J�j��{0[:=����0-�fv�fcvlm
l���BT����r�C������jam`!p���<������R���~�6������~05�C���i`!�Cq_��L<�=k���H<W-d�`���1Jo7��sq?Q��5�_��������jy��U S� }�/�vh��K��>�3��oagY	W-d���Qh�E��b<y��87�[;Q&�����(���:��Zg������p=�/�${�e�(>��=�/~Z�
s��/���������B��[���������BPsP]X��1����)lSD��j!�����������.��������"d��C����H<�=k��eg-D�!����xX���v���2��L[B#�%�K��q��o�Mom����k�L
�����!��
{�����oN�;�|�Y��<�����������G��=Z�Q?�:.�\�C�O]ib�����T�"h�����,�������,���G�<�����2{�[\����p���E����}�Xv�;-4��.�'�k.=��|p�K�}._�'J�����������t��K6�t6���m��?�����?��r������������V�Y;R�����W*p5����(�-o�E5���q^�f�B�j�,�Y�t`��6�ts5���K5��y�f��a����o,<��rHQ|�KW*�k�J\p����(>����,�OFb����,..x���&\|��Hp���K�D����Gw5�q
�e�t�B��	5��Q��GA�w����������0����K��\>a��4��a��kq'-�[{����%���x�}�h�}��9�Q�<l������s����(��1��qK��b���U�����N��1�p=�!��a>���U��`�h�T.u���>���[�p_�;0��^k<�K�\��*�}t��f��7�%�]U�����a��@v�9��R\#���&�p��@z�Y��&T��5���B\3���&g��W����i�`9�>@\��P��jgo�{-E��p=�,�}���+��8_%D�!���yI�8�������C���7�m��&Mjm4�K
\#�p-����!��f�ld��[M)p
�_Na�k_��p_�wE�l�}�����F�ztS�e�3�U�q%��Tl���,>a�9����� �s�|y�������_�y���y�_�~�{������:f��u�B�X���5W�n����p������
!K�/�(�����r���[�y��b�2*�d����7&��p_��J�p�Uz���dp��'�[�fe.?���L�W!d���'\��c.?��V.?���5.?d��������p�������;���Z�S����p��,�[��s=C��
�\�|�����Y���C�?��1�3����!{ly@�/+���u*��|W���=�<8�r���Z��5�Q|2'\�-�������d�������q�fk�+����+��_�4#���#��������:
z!h��Bu)h\sA�./
�B���y	LwMn�;�Du��1��u�C7�k�&Bv���u<��UE��
!3.l6A��o���.-�C�d�u"�<���sAw����f�z$q�:�p->���W��9`sw?v-�f�z,�B������������J�lN����Z;��'�2'l���=���aqel61�����1��p��gG�������4�����"h�6����\s����w�B�
!+D�le��cMm2�ui�n�~_Q �qb?�k�&Bv�����"�G����B�����,���v�� q�d�/wA��m}l<�;�����M����\���>kW!dg��kl�ntn6�K�	p]�,�}.n��l[~���D����vT��+�9�|��6wT�C��Dj61���C��[8��sq��oyZ��m�������v: @'���.�C��cr������9��B��X��_�e��3#�y�~��t�W� \�,5�}0�������7�5�r�{-�C�.�'�k.=�|q�(=���.=����.�S������Rl����F"���x�!B.�.��-�S�N�wx�!5w������!fp�`���}����+��f����e�K_���2�������lQW
 {�q@�oqE�/�o���BBQ|�K�����R���|E�����r�qXW�e��Hl�J`���rW
�Z�8P�����q�o�D�/�Cx�k���-�3�4�����[���!\�A)x�<���
!�NN	��6Tv��sy������#\�Ww����j"d���c1��cY��2�2J�LD�`�����>a�Fj�9���s��,d���&Ljn6����x&��Ap�
!�8l~&6|$7���M������|t2f�x����������|�JN{�e�0�c��q��;��q9�n��@��a��s��@zY��"T���u0�����
!�CF	����o��n5N�6 �T�k�J����RM���s���"���e��s�B���)a�������MF��`��1��qK
��xN��C�A��&B�0����"\��9���B����,6����M|�<�M��1�p���^_���]�M��ae����B�\F	��E���/s�������=��n���z�]rn:�1���s�����;7�s������(��}�uU
�f�pE����8�oO��������M���y;8}m_V��^i/^�e��Z�bh���05~������E8�Z���*�\��U���/���z��6��A�3���{���^K�ry����p?�'4'~]��u	9-���|����xvX#��0�i��E9d��2������;Q�
endstream
endobj
40
0
obj
6686
endobj
41
0
obj
[
]
endobj
42
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
43
0
R
/Resources
44
0
R
/Annots
46
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
43
0
obj
<<
/Filter
/FlateDecode
/Length
45
0
R
>>
stream
x��]��,9�%����kn��M������3\1������}���p�}\Um��u�)�9����u�}_��������-��?���>/v�|���{xY����:������8�/����r�qgq�����W78/�c��������i��{�vs_Fk�c�~���O�����8�k t$�D�}���1�Oq���_��CN9%�e��)���Y\<d����(<�����O1�G���Q"=k���0>�wH\9�����)S�_��r�)=��8K����D8��S��)��I\9����US�8��&q9��S�VS�8��&a���VK�MS�����CN9e��a|�i|���!�����F��[$r�)	���L�\ ��SNIx3�2�k���!��2`3L�2�4��!#E��
W����_y�X"=k�F��!ata
��67�?��S^u���|3��CN9%�M��7�?��S���2�4���?��S6:`dx(u�)�$��2��@�!���0��O�r3��CN9%a�F����9����)��W$�!����4k��e�V��rJ���)S<��;�1H�L�j�U��)��W�(�H����#�wHX@����#��f���r���Q���+��SNIx�#����9���W02<� ��SNIx�#�k9��CN9e������W$�!������D������jp5K���7�(WK��5��e�Vs�q��u�kjw���evg��<}���'����Sb����Pp=vX%7�)���>?<��?og���\�\��������R��,���iM�S|-��%��%�O�k/mr���>�i{���8��D0�)mI�����S@�S,��]ag'�\���i��yY�Y���<>������q�"�C�dM�M;+4>�Z���J�?���%������W|��%t�&z!cQ_�����bvc�� ��o?�L0I��Q��`,J���]��X'�q�`,��Wx�Y�.kic�`,*)|������u�w[���z��R��:�l�-1�TJ�����U.	8&d�����*����\p~�Q�Y-�<��Lq"{+��U�roIe��0���q�h��i��iON�x�IS<����ek����6pR7��O��}_��&�-<9m�u�MqQ�ZE����\�L�4�bQ�&H�._^�����j���M+����ry:�������Jv������M�X��[w4>����5a�e���m�q'���(3����\�3{��l��G�it~7�$+v3
�F���
��Egw�����y�g���|���s��B�{��u���Mgv��jl��i�+o��S_'��5:�gXO}��X��Iz��N�6*���
:h���c!ju����;.7�du��8Dn����R��<�Q�Y����>�vH/k!��7�u�����e^�$�<:�x��m�0>�4������&������K�S��&�Q����M�qKY���W�VF��ca[�g����R<B�#���zO:�exTiM<816���QY����7���X��z��Vz��ca�e��L��s)~���e�������S����%nY�M�,���3�Z����s�����Z�.2���k��Lg�+����!.v�m�+ n2�Z�K����!.��i|���������������j[&]!x,W]dx��M�������������i��Y��(�F�4����u�	�c�kt���P^�	i5:[x#�:���a��vctV<���a���3���^C��5C�bM��N���F��@&.t^%��]����K�k�{sv��<wt�9)�s1lS�H�n�NPQ3�U'�kMh�������^u>�������N"b�m�;X�0������|�('��-�	'�O���0�
2�����Z.�M���*�����7Oo��c�9%������$��No
a��;���Eo�8�F^��SL���ZQ�Y-��X*�X��}I�o��L5������D����1,��8�s6���O���t�F�jT��V#������(���Y!�wDkt�����Y��6Y�X��|�z~X�pl�`,�X��x�X��h�Ug^t�X�����.�=���0���[i7�q�bz��E	[�D�a-w����%Me��MY6��0��w�����	;=/��zm���$�~�^����U-P�*��AI�H��c�	c�z���:=�8�s��'�4��U1��nS�n�Q�A,u�]���`�
w[�/A�\�<�v]������w{Nz>��1�^����Mw"#��X��8s����p��UteU6��h��������+K�����.z^�����M��`��ie�tO����t��l@=�d��k�F��:����=f0^����.��l;=���D�8K�����)i|������b�f����X��&N�b;tE��?�=�zu�^��X������\l{���9V%;�k��<H����*o��9%L"4>�VO��w�_����8���o���i8�E=X��r�lxY��dn3��S��i0�S��y������0�O=�O�a��z�v
�O=�O���{��u�M������f�Xhx����������?�������xY����a[��b���������x������?:~�����_G��L���������`�7�/o�x7c��>��n�������8����b���'��1����M��Q2��;��u�f��n1����u���^�R�}sx�*��=h>�����%#�����7��[0��~&���O]?_�S��u��t��1����N������ �O�����2�A�1���+���k��l���S��|Gy�|O�_u����6;g��
�.:F�����3x���f�aa_�\,�}�����	x[��w�����k���% k�(7��mu	{��q	�|~r�4��/1��mI����2�a��oO
����e��;�`}�/�����8��m�7;>���s})�[ ��F��z��[8��I���q@�0����������c\1�-��j�q���K �z����%�n"�r��wg+~{��8����J���T�����a�D�-��	��O��'�f���CC�oi�qC��-��QYrC�`�3u�s�
��5��6d s��Lg^r��A�P����o����Z;�E[����1Z;�W�Q�D�%�����A�/������Q����	�6'P����o��+�7���]m!���1��Q��~�;,%B���D�=������c��]\���k��������z[����1�]��v%��
fL���c{
����BLo��7�[�!a��-�no[AtX�*����C��}\3=j�1��o��������kV������!k�]�
+�.l ��i�������8D������~^(���e;/#"�N�Q�{b}�cBT"�,��i@���6�G�� ����
7.��6��i����@�����C����;����,���8 �b_�[?�
������~���!iZ;D���4Hk�Z�zS�x�����N�� ������������M���:�GQ�������~���,1�on�f�7�j"�7��:�q!7��o��}����%��K���d<� Z�iD���`y'�����O��_U��~Cw�Le�������%��e�����z~_��c����T�����vxU
�~�~�DvE��-�/�gl��]1�S}�&fWl�����������$@�����]K������-�0�����Ov��O�`�B`�����5#~Lv-�]�G�G��c����8�=_1)�-XhJ�����Q�w�k��U�*@�SUe�Q����]�EC�����X�2���(:6���>�2>�uL��%���^����%�%%�F/Y���#�Ku����{SG3C����X�\�����e����;8`���E�q�[F�<�����[l�%�v�����
�������7�������l�����X�xF���3o�X���;=Q7=��3���N��b�/��j^�e8�N�#r�iY�D������w��~�k��K�(�w�\���;���~>#Q��2��>R�}S������L9o>e�P;e�
�������q�7^y�f7E�v����
x��\b��".JW������?����_�y��v
�B����5b��
�����7��n*3 ���^u�A[�~g_����E*X�v�G�AE+W�����{�:�����K������b����Ku����0is�X��M��oe��!
��'|� qK�CZW[��V n/�
V�}ewZ�7�bQU�{��c�
���bH���>�� om���.5���{U%���
����������9^_s�`���"�5���0q�9s8*1������1�G���[4c{��|��i'"��C��
u��%��.�FL����R���?c���b�)�rg$7�s�
���z�T]y���G����;	������M�� q��* ���-��u5�t;H���Eo����e�@{{K���]g�37�`��0�I��(`��Do}��������n��������\4F��6 �.I�VwHc���{������s��}jWH���7����t����n��u�&��{� Sb�~A/�`�w=��r�����(;Mru{~F�=�m�i(0q���(�����k(��E�7�;V(J��_J�vofr<���
2���"���M: ~sy���������C:4F�?y��1b�o�
��g�2��o��
Q��+��Q��B�u~�����9��.Y�v)�1q�������B:vWcH�.���s�������{�����u[�}������.�cL��KF�j�x]�!��1b9�{|�|���y����������/T!���,a�6�L�@��cN��9��;V��s�p�����B�s���2�v����n<�
���c�
���[�}�s��}O3b��ATD�95�����x�k�
endstream
endobj
45
0
obj
5056
endobj
46
0
obj
[
]
endobj
47
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
48
0
R
/Resources
49
0
R
/Annots
51
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
48
0
obj
<<
/Filter
/FlateDecode
/Length
50
0
R
>>
stream
x��]]���q�Gc�������`?h�oV��06������,��p�!�����MVV��[���a75�2��UQ��>������������_~��c�>���_���y������~�rA�;��6�{��)��m�.(�o��,�s��J��n�s	�������r�S�Ck|�#����?<�e�������c���tQ<���]��O)m7��z�>������E��VSlq����8]�b�z�X�L1�-@��t����b�3E=@��t���y�����.��i��1��6w�O��7pi��rn����Bp�#��<��L;�c�/���^���0�m?�����d��3t��o���5�L�^+��/<�1#E��b�r�~�l����6�f:?�������&�'��%�x��j���No�0�{9���c?��v�>���]���vM�����N�����d�c��}�f����I���q*�JK%�8#�'�t�>FV��v�[���J[N���y���x��-�����Zi�	�6������=�-KnfB�K+����dv��}�]_ ���nP!�3nq��x�O�Q������Y7gb������%�X����^�So:h��*�7ck&�t�>>���W,�"U@i�����������YPE��|�l"%*+$��H�����H��u
�k�Lq�H��c�T�+��$l������74M����r}b�}f)c6�oRt�Za�Y����Y�M"%�$j�]�L�^����w���hE�X)�u��>���q<"�4Z���ba�<F��]��hA3o�O�~+�T
��ub&�Z�kQ��G1�q�����P�`{\�m"-[�VH"���Zu�ZT6��8����1��������dA�P��C��������j�Cq�������HQ�C%E�d�~+�Be��\tI..�X�F�m�\�u��&������p�������L*J��l"I��c���9��e
��6��������E�d�������xE��
(+<���x�{�_N����b%p�3I������M���)��cEq{��iEA0��&]��w�7�,(`0�X�I
`(�qw�>3$�	�x/y�!���B+��R�D#m����y�C�I�4!�L����H�PM�l��n����6��N������G�?R��T|[<��<�G��EO���X*��cB�_Zi���0�x�c��!7���9|��`���	������z�G�>jl���1��6��S�[�_Y���������_��R����7����j��[�Rf�-���;�[g.3oj���Rx��������e���\c�o�����U�V�X�/�0�=��C|$�0��J�Kq�6��`g��)g�����G{��-�:��<�}[�����ShS<��<�H�sD���&��'V�
������b���w#:�5V���$�Xra���1�h��RhS<��<�EZ:�&Z����h1�x��Nj��\c�n^������Y'a1�tf��cd�y�IxEN��]^���\2���'G������>��zD�.����=��d^�Mh��4��7?������A��|D�2-�R�����6��y-���6�;�m%C��JaaE��f���"kE�G#��(z�>���X�^�UE+��k��������Ib�7���q�6�dfP�U�Ds��h�b�e������X�;�+;��%�t���'*�I��������/*�S����1�^do���2`��l`���>N��|X��3����l��c�
��qk��a��Dg�Y"�w�Yd)�a>�
����%Z��A+��ww����o�u��n���t�U�v�E�q������0jf��CG��w�W�B���hR�6�l��t#����w-{Nh��UHS�v�H��&��d/
��i�b��i�%]�����ijt��p�t����n��$Vi:���%�X�3C{�?�������J��3t��Xo�*�����I��O�~�}��(��3�O����M���'���[XS<��>��:g.y�����_�U
��0��>�7�9��^c%g@{}�'��y��k���n[XS�]b`
��aV�&y���sB�-�)��1>��%�=��s��v��8]k)����-�4��H�sV��U��g�4h��u����G���`����A���J�����Lhs~{v��a*YHER4��$�H�lS�0!�{n�XXER��c^S�qS��W$��r���8�o����w}u���gzp-r�Q�sg��c����l�F��zaQ�r�y�����Pz�m�+=�������u�$���O�1��qd��5��\��LHE?��c^T��"�i�s��g��R7��>��7:PQ����8��>��[�m�uZ�D+z���C��{��]�eh|���>N�o��n-�a;�3�h��7�D#ih����1S{$�")���E���5�
!S��s���=:�C��:����MZ6�<3M�1���K�\�\��B��l�{N=z�C�O{yutsS�t��[�F4���o
�jY�Lf�8#���;1�nKv���U}hf�<����(��0��uIX���N#��C�|�<��KC�
]c��3��C<�Bi�_�_��qF�-���b��9cG��B����!��KC�7���E����T
���O�T��4-:��������1i 4��t�#��C���n���Bu�����6���!�['�U���
bT��2�4�b]���r�t�~��7�m�	
��3i��t��9W#i8��q�wq��Ya�VD?�Y"��u^��p�Y#r���9/a�x�pa��a�2����Xi,k��[0��:�*-{�!#��)��y����a�����h�����c|�k�M�`���X	,���{m�o��R��x&�<����jV����8�i����f�$�5V��E�`���~rRDC.�)�k�<�Y'�����V8q��h�����c�T��J\5) ��G_c%pDO~�g4��g��b���7�=�!�ymM�}�T;��k2B+��G�#z�C�hoZ��#��5�����>N���ThEN���&�i��D.��B���C?��5����Z-�5V�h�q{X.���
a��Ek�	"�[��y�����tj��m����h�m�NZ�E^c�����`o��^�Mr��<^�]s�B+��K��C<\��xB���J����O�Wo[�V��jf�<� ��:��%aE������OhW���y�n��O��Y�{��w���a����Z~�����%�P����?]C�i�t��k(?�^�t
���%��5��N�������Q�g�?����_���<wp�k��@�5V+��|��u�����������m�w`�����U������[^��>���������y��������O����m���//U�8����?|��
]w�snl/�~Xp?V�.�����R��6]����R���p�N��?�w�#,��k�-�<����-��wu���qo����S-����w���T���r���s�����>|��Ch�h���U����T������TS����TWI]�
B<���k��jV���\���z�s5#�o���ly�7�2���^����pn*;K���@� �g�*=����K�u)=��0����U`]z
���TAb{�!�_���b�
x�
?-���R������z�\~����J�!����4����p����g��`�;�Q�����FP]FS����7|5WW!t�qUBW����'�z�O�����H��@qk�=)���:<Q9G�]��C��r���L�����sUB�\A�����s������*���NC��s�]�������F��qRT�$��^�����c��&��9��R��p�s��0�����[���|���t��a�Jq������s�U�s%�9W��r�s���S��W�9��V�!�����&y��84����Ia��	������2Au�T[����z�����U	]!|����E���p�~"-���kG)��x��� ����P=�VyGp���C���:�\�h�.�O��T?7�3��\;q�2����y���u��Y����Z�P�����c��p_������B�T�UBu��RH"�~��$#�\B�'r	�������?�w�%��K��W����:�\��yW��vBi�&Bq
s�9����*1��w���!s�q�y�j������?����LI��/u���;sA��(����"�������z�cp%�g�={����*�.��+�
������9����{�wl��F�;�����z�T}�;����.��[��
Dp?����
D�?<|2LP�
����@�"�
����@E��OAt�Shf��:�8�b��)���;Ka�bY��b����k�iu�K�:�����Au�v\�x�����8�OS�2�]��	Y�#�������������7(�\�x���������A�/n�Q\����5��sYb�e��k-K�c����1��Z�O����pY�".F�V�\����Z�f�E��-����������q���T#�[,UM�u�q/��,t�@c���3��R�n}��
G�?9�(jav��b(7��?�T�-���=+!��H%{VB6�
��{�
e�s<�~�R'�p>:hO����*�
b������b���uN>��dXE;
]��<G�;L�K���n�[�JM�r'�)���RPsgKP]:[�������]F��Z�|/�QY�T�S�O�08����t�����t�d�A��vO��������=U���s�*�\���W�v�uN>*�g�������cr�����:���t���3H'C���}4C������o�/�IKZ��DI��(��a?i!7�?r����S����O���M�J>���������g>����N>�l��PN>*��y(�l:�7�z��;��F�:���G��,@��,Au�d	n���yN�?����:�	p��C�9�s}m�yN�=�p	��D�t���q~J��w��J>��a����3����w�1�[��.�G�=v����,�����{����E�A1���`;��O�����x���o����g0���}{�ym���P=�9��Z�����so��0x�MA�6�P�*�s��0��:l��i�}u���p-���'���\.�����?������=J�3���n�u���p�
����S�as��UX��p�N)=��a�3��[��.���������W���l�����w#9�u�P�M_�5��n�to�e�~�W�2W!t-�V��\}�c�l�f��m�����ovG��� {������TzF�m���d���/n.?\��c������n�>�rUX�����`��5J\����~=v�0��);����0�0w��s0@��.m_Vc�.��k.h������*��;`���������|��!�a�����6�{m+e�.9W�5���OKi�B��+��M���'����*�����T�����;|=��Z�M����%\�4���Ia_!T
����ul����0��Ha��1C�p�X?�5�����]2��k�2\�B�q�B��+�3��W�@��zT��C��p�TT�1�q�;�L�1�t����O&�.��k����\���B�*�+DA
�)�BZxT��ae9n��\��p����%�
�V+�p
�qU��
�s�l�9�\����#�>�����)��5�*����9���d�I\���Y�k�6���W���J�$f�����e�f���u:�5����;�jd���q\�P��v���sBg\��q)�CU�����)��0�}��6�s������c�����o��;06�q�d��2�}z�c�^	�\�
��
Gp]�;�
�K�/�*�`{�$����>=����!��\@���!�{U�	n��Tz�{��c�.���k.?\s�!����\~\����X�W�u�!��#����#��C
o��~!�����	��1��p�+Q_�;|����GV���p����k��!�.' |����X���y��{����S��A6|�=W3���r�p�� o�gn|�����=�O	�Z|J��q�u9��+�*>�-��O	�\|�G����/xI�a��������p_�j1����F��������?0@��/m_�b�.��k.h����������I���ru �o���z,_3��y��m��6r2d��+�Z��0\��(r��
�3�l�A)IQ��\�8�[���T�\����|�a���-Nj�4	�K
p�S�����:����:�8��p]:M��Z���S��~�n����Y2d��+��3���l��Bg��:��)��m����}��!N~��-�9	r}
������t����2h{�P=
F	��k����B���`����:�BZ8|t�!{�0c�������N���s\s�\g�+��9W%t����
�k���/���{�r�'�R:������uT�4.�Z������
���lv��r�8�[�[���T�Ip=���������K��K�p�'�uq��B��+���������p=v�1�]!�_������
endstream
endobj
50
0
obj
6632
endobj
51
0
obj
[
]
endobj
52
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
53
0
R
/Resources
54
0
R
/Annots
56
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
53
0
obj
<<
/Filter
/FlateDecode
/Length
55
0
R
>>
stream
x��]M�$�q��Q7c�c���}P������v��
y���0���-[2$��73� #�/�j�,����� #�2k�z���y������F��?/�����<n�_�_��i��M}�H@�������z�L�x������+�x�eY� ��z�~���������2���(������_�M��������nm��9�v��g�����C���:��Nb��|���� <NS����$�K.{J��,)K�_��g)A��")K<�6�v�eO����,1�G���%��3E��2��.1�G���Dy��%c��!�]r�S�:IY������I	� �XIY���)�]r�SF��^P���_��\���� %J�_��.��)	��,qlMb�$Q�&IY��8��K.{J�������.��)#�%��W$�%�=%a+���".��)	�FR��{�����$�\j_����c��0�("���T�}�����$��&�d���;$� �"��������D����$�������.��)	�2`�p3��K.{J��;<� �������c���".��)#��EP���_��\�����}���$\r�S�d����f	�\���iM�K<q�eO�l��,�������������D���w�0.�a0QD���p�}�����$��&<����wH2�$
��;<� ������"�O7H�����x12`���_��\����;ln�p�eOIx�c��q�\���g����!���i����s�0�����v�*���J:��� ��#L[[�RS�<��87-�����B�
��	�g��Y�����a�����/B3.��&�@�q��&3{9��M�Ig)�f)p��K\����E�Ie�~�o��E�7��,q�qK�eU��S.�M�C����1F�b��CF�e����3"�G�����
�l����?�(��\�ufVu�T��T�}������q�mg:����0��]�DS����D�R��\>l�_Fb���:�&j_bi`6Z�����������;<L�K���2��f������l�u����t�}����>k7���#����YM=�|b�"c!:�����i�4�>^��kE���E%L"����������>���=K�����7d	�����3D�c!��Y��R3�\��Xh.�h]�r��j�;x��l�����q�%�����vsd���`�e	��/��p�dP�f�6�R��$m�,��%f5�5���k���-�&�N��I�a,T���^�U	�
�/qQ/[G���=����jg��r�k���,qj�3-�[��ga[�M���m��C��]�-�"�.!��BaV������|X�Y�b�d���8bf��H��WsnRe�#W�F�������u�Bw=:�n^�-�L��Y6YU�T��/�|�{������O��V���I�����K�QGa�ml��I�,Z�}���X�.[4�F�����7�bS����/�|��q�71���I����V����||�:#�&���U�q�[3�d2Y�2E�<�P��g����}�H�s�03���AK�*�|qu�f��,�e
������}�'M�Z��g�~��%��|��~���<���-v�N,�w�8�P��mQ�`��f���,Xf�����Q�`�b#���%.���s�`��d1�z��Y�N(P�wbI�k���+�d2�|���9����B5�����R�b6��-����I2����a� B���D�b�����j_�^,��L�Y��R5bR��%��qA�:J�^�ny����X���
���s�m�3
+��V��(�LGT�}����A��K:2�(�\l�%j_b)zW�O����4p�pK����d�I8�FLt�}����7�`����l�n�&-����n2�$��i������l�4s�-)��%��d�4w�w���6K3[��l������l�6&��LWl_�B�z4N7.Wv���'�,�_��Xx��z����fq>�e1������Fb��;��>�e1�����b'9+�����[b����]�9S���)��\c#'m�c���Z��l!�U��=�V���	/����;���C����&�G�,����v����/���90�q��{3����}����t��%��o�W�n:m&c�`d�/q�{�S����?�-�����;��������qY�fQ7K�����;,E�\b��J���2�g�����d���af��H����(��d]7K��G�Mv���w�&�d����(��;,u��t��l��/��}����\-c_u��'S�K����h�A
�jy����������,�P���$-��H��nxz�����������=�M<2y�t{��w�����E&����(����<���e��	�2S��X���yd����E��X3�G'x�Y7���S�o*�N�Lj3�^&���VP�v�=�L9�p�^R���������aP���/q�c>�Rn[�v�9��0{�E���
2��M���;���U#^d��a�z�����t�����Y�a,u�=�y���b���������[WnY��R7��42������d0�U�����}���U~N���<��K�8j_bs����L�YB�M�Klx*�����#7��ly2�����~��������d��a[�����f����;<��h��W|���SqR{��{������L�g�|�BL���y#y�L�����F�����r����Vf���� 3��o���F.�f��L�;<�L�gs�����s�6����?��E�����3�6��L���%�4���Z�b��>L�?
0}:��`�t2#�4@q�z�G�'����=}x��?��%� �bf	�i"�}��������?}�?�����L��,D/�����[��O�_.�G_�x�����~<}�����?N��~y���_�|���o\�6���+(�PE:��������������;6��*�y���J2f�p���s��3����5�u��*^d��>y��U��z�=����>��4� �TGl��
�cu�u��z�t������wu��_����7��Z�����s��	�����qo<�s;��>
��2�����[�#��`���o]����>�{\>m
��Oc�|�������	A��A��:��E������sk���\����}��qB�����&�N��:t�����W;�:��s5���"��h��<��%f����n��h��)�A�����0������~��M��G�������T
07�J������W��C�m�OSC������b}�����q�����X���
B�}��9��O�!������.��fg��Q�Q���X8�M���:�^MY*G����Q��X���kd�F��Z?������� ��j��m�����$ydn�q���Z�}�f��"{�!qme�<��u;$��
yU%�9����^K>y_<�������6H�i�6q�^e	�>*
�����;��v�����M��Z?��.x?W�sa���z@�6���6s�;�UNT!����<��!��k'���A?Q������=��n2��/���2���1�'��U.�m���AF�M���:>�6��������XL�����IF�
������m��+#�&w��^��g�['��55��������'���o���
22��\}�1uv�T%ML�M�����j�����2��!�����X3��Z���`Z����-;1�}��W����g����&>w����"��;.�
�}��%����v��|��>�������u��s��suB��[7�n��oz��m���q7ln
���N�����O�9������}������3��S���LG��.�:�����5z��j>tp�
J	�����0�Mn}q�>*B�����_s���aZ���[�K/����u����+*������
w�d���(��M>����t�AWV���E�m0����|���c^���{�h^�}�!��!��k�X�u|8�A����)�k�
��$`��~=��7�:�HWO��S������_vJ���)zR����������ZH�����q[���g�k�gE���
�_b�h�k{������k�M^����6Q���[n��������[C^e��]�3i����ks�B�������"��eM�"���U*O|�{d����i�DD�������D���L��~��yk=���m�??�9;	��u��sm@����'�>v�H��Yr�����nd����'fh@�����"@�'J�y;pk_�uvFH�.I�.hWi�J�����e�9�]�dd��034�~�S���@6�����s�s�.��sa}��4��(%)D���^�x &��e���A���I���k2C�'?�_[$)D��]��_$C���M2r����4��]����c'&��YF����e2�����

��Q�����+z1p������`p�2|��7��7������&RS�k"�F�[F�},��>~�B+"~V�;V���aS��$}�Ac��*+�k��F��D���e'�E|��*�4�m����W�u��������0��V$����Y�(�!�c#/<0H���������R���j���
+�n���y�_[H���K�ux�w�1�:�����?o�&u�U��������_��� ��
RY�b}`E��ipkRsi�����I����Mm+������K���8���{���A���}�� z�w�x��� �AI�UZ���a=��?��yk�C�2x���U�����2PW���O�0/Te���jpZAf�
*�x4�wd�����������~�g���������j�@��E�5H]��`V���[%���:�	u~���s�����>��m�q�o��<P6JJ��������P���iHRk_=�9uRr�w`������wffH�x�0{`V'��{�C��T��6����R�����'h�:)M}���dF�!LT������0��������-��-(���D��r6���h�mS�rK2��c��m�Q����u�Z�����W]+J7��Y��=2����.��������n��5�i�����dkfr��q�9$�~R|�wH@��'kK��!��������C�{W|EsHx��������0�Y����'�#,�u�z��9'x���`a���opPy��g����cR}����
���C��\	dO���zk�#��C��f��_1Kea���.���%��{���Z�1�S��2��]�j9sHD��G�saD�����G��!�m?m_2��;0��0Jy�,�������uh���Z�����5I	1+��8/��~_�W�g"�F�hD���>�}P�R�^r��2'�����L����qA���@n0C@�����EL�H#n�;.����`R������T�zqD����.#/���qA������k���*S4��M^�\����'�l��u�Fxn;��?4�@������F34�~��1
f���a�|����w�$�ez����T�"����@�P������������?", {
endstream
endobj
55
0
obj
5607
endobj
56
0
obj
[
]
endobj
57
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
58
0
R
/Resources
59
0
R
/Annots
61
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
58
0
obj
<<
/Filter
/FlateDecode
/Length
60
0
R
>>
stream
x��]��$�q��P7��������c_�a@�NCs�$�"�!����H���gf�YNcPV�E<37�%���[�p����������������y<��~����4��N��O�p�@�;�l���o�i���P�������env�o��������>6�����fw�����>�����_�����C����C'����>����v?N��������OE���b�GE��d��=�@q��u�]��tm�����@g�I���f����_GM��)�x�y�<�A�L��a3C:�����g7�e4��!�r�i��qZ���$o�vI����qX��'�������f�����L�M����m��$��=�����y{��k_�&�������x{'���c�?���h�C�h��������[������G�tn��s�]���m����'�f���]S���7�N�t?&�����!��YS����I���H��L��cC���F�Q�R����X�������_�U4��*�<��)j��Vy#K�����v����,��i�Y���M��=�����J��V%^j�V�*������5_�fjU�m:N4���X�^y���:��=��~<�)������r�������9���#}�k���;�l�DK�Hn1i��D�7k��f�t�&��y{lh��j�^��H��S��y{|�|���HAM�h�Y���9���6��&EA�Fn0i��;���P���m:N4���X���a,v����������O�$�������U-i:h��f���V�,F���5N��8����cM+�'.2�'�F��:J���ig)jf�3.w4�[�l��y���h��|�+'�F-q����~Bf�7{%%d�&)r2H��������ng���	jVu22�Z2h�+7��MRl�L�DD�d�,Zw<�L���am�����F�~6z���73��(��I��H��Fj"YK��=�i��Y��h�������E�(g
fJ�q1�=.����������K�|�����1w�4���.xV�e2�Gl���134�-e���7�;�Zp�Z���E\�����M��@$���`�6�I��(��z7H���H�]U�>Yn���J5��(�����m�fu#w��3���MX����F��7�f��P�^L
7�����wz�q����F��5b��63�;�?�b�����dC~���NdT�tC�O
��QI��*@�axW����JRAM�I��0����Su��&��*Qd0�_�.n;U�&�l�E���0�
��wQ��
Ur'C��S2���X����x�\k�0�R��a����*XAE�d�}�X�g0�/���k&���!�r�8�7��AKNT���`�a�l����'�tJf0��*ZVOM<%�p0�#�S2C3�O�n6��l�����J�}u��ZN���D������&fn[SX������@�����A*=�:7t������%5�4��%�`�)��f��&r2���
�-s ��ifJ�n������s����WdN���)��d-�M)kf�i"'�dj��p��A���u��b�^M���E������4�:��%��j�wH��'FMN�����A��6������}7}�����!�r������)��d-$���^L�N�����@���eR����A
=�b17��V�L���4�y��R<7�����D�J����M1��[g���Mq���8f%��������/����C��T��A/B�u\�M=���3��37���;�M)J�d�M���w��,��E���)=s����?�Rp7H�i*�+�k����4U��%���Z�)+h���g�J�Y�{WR�T8�k��:i����A�)�!�r�jjSg2��@D��� ��$<�&NnKS_2�Z0���:��
f*L�N����
���ph@����A�q�~�S2C�
����|�
y2�dWW2��$�r����I�������-�s���r����A�_}���R��I�l�4�D�d�^�����O2��aN�����A3_�a�r��%f2����
��r�f3.1�a4mn��W����R��I�l�4�E�d�Q�}�C�F25�$�l
��`x��%�<��&N2��|�
������C��T��A�47/�fS>�����j93�Sn]��R����������p_[��������alL�����al6��
����I�����w����
��!�r�������e�$�5H��"]�)lw����L;��N�y�u������L��Tk���o�����JwLa�����<6f
z7H�'3f�����*"gmM��\k� �o��&V����dj���o��"1����A���;7Lz-��+,���5�73�%������L�M�dXL��&]v��~\L�����O������,�)��h-$����bjobM����
�w�/�f*�� ��9��dh��wa�A��B�����S%C�����3e01����J��R���M��Aw��f��������D�e*P�$�L�9����XYLSzN<3U2����TML��=w��H�����a����WgS������a2�fn������[����n��$j2L���
�z�D�O~�����;1s���s�L���M���c��� �z1�fn�T��6����v3�!q�a1hnP��x��E�P<Q���`�S�A})��n��zY��s��I�d�t�;;(b^g�|6Q:%3��s{����]yU"�F5U*s��&_�T-%�;}��7�rs��tnM9�$��Aij��l��
�}�t��������L���:zS�MM�����T�.1Sc6f�����A1/��i�C�L������&mLE>�4e��o��!��7�n�r���
���o�����593�%��N��Qk���`V����M��T��A����o�F����AJ>�:<7��J�B���*'f�T�L����x�����+[	>�����0�J<7��&����}�S�=��|7H�WS���Y��b�9��=�)�gS�3VS���|sV@�,�)���T��A�t!����j��� _x��`H�����4;�'s��&?���:���N��56u��s~������.���w����J�VW��7��z�n'b2��[������������<�V2h��7,���w��g�J�^����5+l
yfZKM-�)`���j�����b%��>F��{h������)o�����<�Q���W��X�x#����������}��?���������C������!��8�&���W���A�����D�_����������%�~,Yv��1������?~��������#�}�������f��2�u�����?}����~<���w������_~��}�K��
��M���������U���G�����?����k�K���>��
��������n�E~����*�~�������=�����?��~�5��uW��l�>��*��Y.]�����~��>?�����U���o]�RR���[����I~����d���P~�j��/)����q��{&�����\�.Wt�{.�vjNj\��2u�e����
��z�^Nq!�Z�.�q���/����M\�}��=v�pk�}
@[��� �C��=:�gw�||�����0�C?����a(Th��s}|a���?����>�q��<��6|�Dr;.��=g9�
8�����~��KP�/��T��7�\�#o�A��@�OD��1������Zf[:*�
�Y�����l��s�5�0uvg���#{�l�Vgq� �'T���lkZ��m���B!�[��� /�>���~<y�'w*���B��a(Xi��>r��� r���0�;t�9;,>s�@T��W��m�����R����tm�&�@}�~xep�NiP�x�n�y{$�KI�AA*k'���T���<�^��p�`;�!NZ n@�z���_���N�$���={ F~:uMu3�`	���`T�
�`Y��qTr�K���%���&�A������b�c�Z�1�1�cTd����\;^�w����v�##%Tg�a#zX���(����!����k���4�,M���~�S�7��f1x��@	�=}���[=��@G���k�h����k�DxbA���^������X��>��TB�s����s��o�5�":�t�s��6Msz�7��C�1e3����l}G��	@~:�{����	�~<9N���`R��2a���p�!i;|��x�������[���n}
^�(<@?��(!�VO�)O��O��w>/Z������,��X��1�)<�{���!p}]��8�(����C��Q��,��C�{?��n���X� ������-]wD?�������`�F�!�3(�c�����]�(=@?,�)!����?��O,h��g��T��aO,h���%��������C�K����?�w;�%|��<���O~s����v}`:���U�cq�K1l��>�
���k{	�>���������(ZgTC����|M�����5T�3�y^������ p}8�8
��K-�3A��R��|�%�:&��%�F8(AdGm!�Q��*�
��Q��*��SvW��I������p�=��&ZJ��^�X�f��������~�|���-�J��|����!!�Y@�S���q�B�WR������k}`�|A���|+����[]��R#�pH���K�d0�����2����Q9 ��y/�ks9���s�X�)����[q8���(����Y7�k�<A0b����jC�1	R7��&%��E�;��+�Q9('#@�+��	��r<��Z	�v�]����kk?r��n�Tt����N�!Dv���n�����G�Jqv������
�1nX��vC�s����c�����BT�
`c�U�7}$�:T���z(�.�Q�~��@��@v�t��'��w���}s���:��G�1).F~	x���q���#;�pE��=����c����.�����F���oJt��3�*F��6�
��pP���1U1~*"���! �Uq�f1��N����A�P���Uq?�m�)��������V�
!�c[�t�P9�d���pC��c��pC(G��C��k=�z�)���%T�
`c�WS#���i���r(�.�Q��������Rr��2���kvW�K�*E��:�o��#��!���)�X9����\";�����24�}k��������k��-Q;9�o'B����&�����p�C�1�n@�O�W)��l��c��\�+��%�+s%�G+�y4��+�����%������%\��F��0 y�9]���M���n�t}�!��X������O�A�>����=D\
[�%���f����/<�q���c�Wb�Cvw ����k�)A��R.6���������,ea���f� ��+W�.Tl�
j{��_R��RL� ��S,uo�@���$bx�v�4��}k�0�.��Q	@���GF%\�����r0����V���9X��**A�_j$�R��"*pOmH�x���DV��_��9X�U�Q���ev1l�
`��!��Wr�����^����I'�k�J�T&42,��v-O�����E�������`
�`
{7�����"���!{�����g�Wx"Bw�$Cx"V��HNx"��������iif�n�O.�T��rt3A�t����{����;`�#���~�`	{��LI��;�������;`�];�;`�T���1E��oP���;��'B��'?�AHI���5��kc�z�4�i� �]_�s_|�r��8��t�>`F��;�'�����2~�����=�*~@���w��4	���?��������������=y(0y�z���~���D���x���D����|�'�j�<^w�:��&�W�����zr�E�qr����9��Sw��NbTwO�`czZ�����/�c�:���4�����czZ����Z�{Z��b���p�ic���p���W#��Z�����rXy"�~<y���E%M�=-��������c}9��<��E������k��nX���iO��Tz3��_C���������~~�>l�
�_R����a,�D�!o�-WW��W��������$!����u��r�_	�v�v?]B��2.�����"t��� ����"���wSZ��,A�D�p����D�~���D�~3��D�"�HT@�D���2�g2-�D���M�!\?��q��!\�R�57�x}&Fu�X6$����9V	��c!`�1G�X�ed�UYF��?�B�O�m-�c!`W���
A��x!�����?��D6w��4e$�1����D`$��*DF�P�U$
UYE�r@$���y��/��_	:��C�_"^S��?U�$��d�^n���b�8���C�1����Bf�c�Pa���/��K�5���l�E�1Y��$�����A�� ��[������*��B�^A��-%��
jT�+��$9�=���K�������������R�����t�`��_���<T����< ;^�&:X��1C����Z��^��nKNt��!���{�j��JZ�;��?�E��o�B��cDz��y3{�\/�V���/t��3�F�=5���P��h�� �E�*Z����@vpO��C�yb��X��kl���nKVtH)��CJa��T5V�{�b(�+���=�@	l���b�������Q�
endstream
endobj
60
0
obj
6839
endobj
61
0
obj
[
]
endobj
62
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
63
0
R
/Resources
64
0
R
/Annots
66
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
63
0
obj
<<
/Filter
/FlateDecode
/Length
65
0
R
>>
stream
x��][�4�q��`��o��? y���}�k`���@��!z���1����7���3�d�A;g��t��U�����}kn���������o�~��M����������o���J|\~!���o���k�qn��]X|���Y��y�v�o��l��X~q�F�L@;�;�7P���r�����K9�����n��oM�J�f���w��$���/7�����:;u�vhmP�3:���sL�o:���	����5�zg�,�~�I4�����-G��&	p49��J������6��m��;�%`�mM�={#�K��N�I�&H����I.?C3�G����������I���Y���;�%`hmM��l��-S�h�hrVgk�h	�N�&��������`l;I���9���;�%���N���#�Z��P��.��qPqgT�'fR`�w�F�����L
L��#��L�,i 49��7z�t^�$@��!��X%����`�&	��{�H'�|FfR�#:��;�B���L
�������`nT�I���������T�I��r���ZzfR������a�G�f�o~��f��O�z�*�'������L����3����>�o�;��Q�lv����B"��73�?��
��S�����MR��N�%`
y��Ry�g?0�x9{_�[>A�&��|��Kn�t^veG}��R�xh�2� ���z�������4){O������	0M��t�����u'en�9@*7�z�������o�������v������o������!���Iz|��5��2@h�������������Rh�Rz�M��v[���[���Nvv��}+Gw��3��^oyS^��i����y����� *��B������c���Z��������D�\���~R��$��j�%N5N=a���^�.iSg�2�bR>���.)Sg�2?�B�$�]G+� �����Q������T�&@�p�~�'9\�T�5)�I) Q��{����Y<}�c����J���=��v���7�zA���Q�v/���M@��&	 -�F���n�=���Gr5sGM`l�t|��E��wA�OJ;q�0����Q���$W�������y�6K��
��#e]��I�[�3M}���YgUi4M/�����3hG�:aj&�1��/{p4�S��UmDRG���}1�&�u���M�e�hD�t���~�0y�&��|�a@�{[7xT��UE"	Q��w���8��"���a�<�WU�M?*�&�'��3��}�L��9@h��x�<�\����I�\�4z}���I����B0([����8)�&[+���E�
��z�#
� ���E������z}���#$DM@(S��K�ZJ��I4���=���	�VLbe��A��nT��+>Q���|��(����D�L�<+�&�Z%��2�h��3��2H�&@�VrDm���6�����W��@	#3��1Q9ZUMV�Ph�J���1k;i�V���]���
��?�T�r����B>��/f$Q;�W���5I���j8ad����� ��S�p�h��C�Ul|fv^��ye��UVM�=�d��a[�Y/b�z���I������g]c}�%� K��de�!W�e���w�Zf��dq)��N�H9I9�������By�)�p��tZ��LJ�$�M_�a��d�|�����0}��c�~R���"�$�sJ0/%�${U��L��!>�xI�{�=K��^�o�2>	Q����#Y���]�g�������zYE����d����tB�P��ib���=* 
� ��i{=F��#zZo�h�F
�
M@�5�����]�]�V�R��d�$�.#��K���yj���
"l/[����M�7I��USQ��!=�[�w��A�����w��5I���S%d?5k��^G��.tb��=�lj��fd?5i@��� �?X\9�����=.�����x�iDsa�O���8p�(�r����EO����]X;(G����(�s�T��l���US..������=O��j���%���<���B�O���j� j� �C^��������f/�7{��*���p����_�&U-��j� B��w�=$�Q�]�j���l.�H�T1�������=Z^Rp����uU-u|��pe2��h:U�wz�]���=>u��N����N����J��>S���:U�wzCa�9@*_.=�Vz`��v���S@�t-�UY��������E����{TU5	0�":����2�N���%����N�KR�����������K:c�$@7����h5U���>�AM@*��Bg�v���vz�h��:6�M_Z2�J����oTaKgM�I4E��Q�OJ:NU�)�v�����g��%���5Iq����4���C�����w����6�$��Y�7"��y_���e��-�� ��~,��p��*+���R��GG�����9aE��$@/j�K�=�r����6h�&	 �/����TII���-j�R�"H\��g^�{T����KK�$�Jg��TeG���1j�J�c�~?��j�=��^.�&���
C�9��T�������NU)���J�x'���T��\xP�}
8��~t�p�%���m��sS��f�g�z�T
�\���/&��������GeE;���92�B�Oc�������S���q���J���k�����r��&��A�;��:;FM`�}�W�F���I
)0���.�B$���"j� �?��2Na�p��
3�$���������IHev��g���U�T��9�i�������8��K'C�M>��������V��	��?�H�hP5��|��x��$����a���$@+F����
U:�����$��GsN��6V�Q��&}����Wi��U���HC.ol	������U����C2�n|V��0�x��e����tP����Wv*�H�&��p!��N%!;�U{�s����w��|�����v5��������O"���7�w����,����-���?>���bv�����_gq���g�~hz������A||�S���g����l���f���������Y����N����~]���7����������~��|��^!��
��[<������?�����N?�~������������?��X��>V��.C����R�%[�@�ZD���9������x=&q[�9�u���n�d��2�n���f�_O��*#���,�c)�\��/�WL~������oO�;�P�{������B�����c��p�3������������}�.�r��Ab��/��B��x=�o{8��f���V<}��N�y����Bw�����q��"��
 �_/���AB��S�/��������P���c���%,�!�������q�:��q������SuE���v^w�Q	W���1qi�c�,�����yy���#�\�er�����C��,e��� ^��}}2���C��<e��.��}y-�����6C|�\t�_w��5�E��e	s����"����
�.��>��6����E�q��������3�Lm!��-3�n��:�G7�|[>o��,@�Z'X���" ~Y�\��S}�5MW���������q}���!<���I3�L�,����E25��C���b>ny4�v�!�O������y������]UX7�Z=z7�\Z��I`�O����4[��dJ�.��f�f��`
��D�q}M	�+��xL2���If��0��GhV�b��P��g��4��6���:����,q��A��������������s
�!q��/�$���qi�s��m�7�S#�����xs��v�T��Mm!�9�\��C��������uM���x3 ~.�b�����H
�m|/C\�{����Lm!|/�l�{���b7X�c!�M���<�}�������nC�4��<�;�n9��Q��-Fj@k-2��y7 �YIe���"Zd�
�n�|_�"�Fj@l�="C\��X����.���;
�M|/G\�{���|�c�gk�{������Fa�GjDl4R#����-�}�������:�{�n,���c)����qu���+v�q6����s}��������X��-��g�
�RD\?���[
��j�������b��=S[��0�������FK�:��9!�W������x�>Nb,��7��'1�`@�������6��i-�f�k���n+�x��O%m�16���V4c����P�\q���B����A��){�!q�H\~��b�|�����&e�����!@|[|�<���h�����E2���C3��`q0?lQ��E��-�lR��/���"|�K�C�����6����9�5�m9��[����&����<��Z���s�� s��S�m�����9&6�=����{Yd���3O���\�qP���qu��������a{c#�8dje�2f.=��`�xi	r�\r����e��
V����/D�\��,��]eU'. ��,�6X/C�6� 6��B��&�e��a(c��)'��d��2H\��,R����,R�+��{����E�6� 6��A�/&�e����1Fi��|0WD"6N�K�bR3j�yY�h�f4��+�4�Z?�Z���LZ���wF06���gSc�X�t/YK��$6�I��_��8�]�^m'�6Kq���|?��������1��,sE$����M�PG�C�����[tx1g�y�iU�"n�q�������z��f���
S[�X�1F��b~2���b	��
H\���qi1M~7\m�(�.���}�?������8����8�1Fa����_,rdlf�[����3d4�NWx�b��~$�6#) 6�=F�����Mm!�9c���1?Z�1�x3$6�x1q����'��5�j��&�������b�#i�������|��������;	�>@m2{���O[�M�uc���X�]������O�-�v����M3f����"�pD������8���{�C"���R���[L ��
���6��������b�~�]��x�N��������c_7�����2��3��V������_�����f���s�k,19���+N{��	��S���f�T�����x;��G�6�^�����~	������|�����n�2������1��KM��^�k�_���	�W���1�����3��L�u��x�wbV���>W,�	7���[�+_��}��P��0���~~K�v���05��rh��-�)�K^�+
������!q�)!���vW��I�����C/�����IJ�_�t�t{h;\��/�W���}�J�����4�i����-��0w�i2���yf�i��Y>4`#Q����Jk�]�6�|��no�,Q`t��T5�F�"//����x�76���������QR���'��1�5�Z�Y
|��nV�H�0�S������gNd��x76�#m�8��j�3c��l��
f��X�7q��Z�����<�?�L������Zcc�������S��u�hl��f�;V��y��lGw��?`�z_A����M��D�}��ic����~�{�?���
���A�~I&�n��MD^>j���q�3�������_�z��K�\�z)b��q�_��h����4��������`6/��j6Q��"��d���Q���p�bB���b�MA_�y��9u\)��&_�����uy�*�0?T������?.p�����uW�|��c�����eDn�n#Ll09��k_e?��8N�Y��ZDk3�"��n����~{���Z[�p���]�F�����b;ky����w#�����H�����v�lg��7���u���y_�/<��r�Y^�H��I�a�������1c��R�y]��PW}�c-&.�Tf5�4�ku����fF���d���C�{6��@�����JQ[l�&��nj�]����r�� Q����+�G	1������Tt�~��HF��9R0��/�0C#��G��|#s?X,� ����|�s���*�C�V����@�k5Nan�q*�}|��a�����`l
endstream
endobj
65
0
obj
6190
endobj
66
0
obj
[
]
endobj
67
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
68
0
R
/Resources
69
0
R
/Annots
71
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
68
0
obj
<<
/Filter
/FlateDecode
/Length
70
0
R
>>
stream
x��]��$�q�G��}�����A�L����.������zX���,�2$��;o�����,f�T��d0#A2��v��5w���������_�w�k��a����������o���(�q������6�[���<��m�������*�s���?wk�������n�O��������~��w����~)�������V��tn�O�f���$B�?��)������0�ph���l��-��h�hrQgk�h@f
M 4A:�8�F����m�~M�hrQgk�h	���M���E���;�%`jmM.�l��-�U�	p4A:�[#�?�r�|�.	@M.�o��\���u�������-]�t�u�u�F�����W�	p4��3�Jg���4(������;�%���6��q�h�h��q�I����������J'�|���Z��P���q'8����O@�i����&@hrI��C��Bg��6B�K:S?�:; t�N]~�&�N���q���������������Uy���9x���L�����$�����~�lj�H@d���B��,=X�q����v#�)��l#k"q9��
��z�3�J����^�p�M�uq�������c8�����g�aq��2@h�w�S�uCH"��x�l�O�
j��I���c,}������'�� DM@)���^I7�b�#���4'b�t��Q����u��U�m�_Gz^u�>6��������{e�p2:��}��Qi����=�������ID����`�����k���A��;)�����2���P��fe�~V�'5I�~��u�b�be?)�!j�����I�����r��S���9���f���b��V�W����o�y`���>�K�O7�����b���������ID�3��9�Bd� M�I
CM �5��"�Q�i�D����5�������������}��k-4N)w�������.����F1�:%:��j�Fqt�L���Jy���PLQf"R��W1�����WU�1.4>��\��������m\Q-RO<p��bR��.��D����u^v��<�G�[x������OM��M}�*��Y�y�U��Fu1ba��Ef�_���|�IV&`�g&�������}%.b_�bi�gu�;��V�������>E�4�����^c<�	���"~#�fq?)� ��$�x<�0O����`	�T��+�;��(H����Rh�J�J��!S���e��('@M���m���SyA
4���R���<��B���9 ����I�7^]f%j���tjn����� ��Lr�k3������������+�v�(8�:<������9 ���#�y �:;)Q���y1����z�l�^%�nT�e
�N
�V5�2�)k�U��E��b�^m7�^m7�5I��R�-f	2[��;v���J�S`��f&���V�J@�&	0Fy�mY��UZ������'@h��+(�]���,�3�R������i������e�$�Q�kT)������x �o��46�r����&	 T?�tN�P; ���a9�E��Wnn�B�����rXP����_�6Xm�u� �bm�%�n����M�?^���*S@��\�Y�r�2vP%`���	T���6����*�
�����g7�ru����NS`��d�
�%/��*WI��������l���VI����s�j�����*w@��6dC�M���+��b$����@���D�d�Y����$��~c���oU���x�;U�@�?���B�	@���h?T�������	�T�J'�"i�8�����I��28����
��S�a?�:0��h�X�K>h�Ta��T��������j��	Tq����eE<U,�2��
S��^,GB��$ ;,�S���������B��0��4�4���o�R%i&U���PV��6��TI��$@��8���Sg�v@v�,�&	0]=�"t��t���r���(�I�@����|P��=�I���L@�&	+��YD.6�*�w@|��(�B�K�i���T�8�� upL�,|�8�SU���$����%��U%�H��v���GKpRe����������L�lUU��5I�e$���i;U�������K�N��'Ux/1���6��%U�?�vbH��w,�f���i�I�)0��6

��]Q�5��&	 tn����U����c�
�M.v2���2�����g����'��AUI=6��N������~�iL�)����K�((�I�L�U�">H���2�W�4��iy�((��K���w��{U����m^	�$�FVE.k���:���I������W�l
d�,v=����W��Hc���L����#��U)��>�
gk�jrU���8>�����'w@���������mO��������o<D�<���d!�b�L���$�C���qV�c
(qp*�����j�=�}��$B���b�������U}��@R�IH���rR�aU�%���Ue���Q�$��hC�P���D�������5EM�)y�DH�II��]��o]�%iGU�������L�<����x'2(�������lG>ra���$�Kk���:��I��@�B�#��yXUY�J>�#&AUAO����.���8����i[�yt9����Oh�2[R��,�s@�|�f	i�uUU=��g��oK9�Q��; �=�J:�[���n��&UfUW��P�a���U����s���M>�b�2�M3
���*�@)�z��������[Ud'����*�w@��Sp
4�����D���F��s�J�P��w��}������4�v����?F�7�^>k����3���������y�����g��o����g�|X�Q���3}>.Y����9znX~~�L����-���,V�s���g�������N���?[�������������X~�����1���cs<=�7�dj��������N?�������x��/w���1�����������������m|��g������[���m����p�Y�����s���������Xi� �s�����y1�}��e�
�n�ih3����O��3�%�|�� ����b�/��������"�K��2�e����/x����z1����<����B���{���q��{/x7b��r���8�I��)/�x�����w����z�E?�n�p?U�yZ\���,����<�4�e��D��_M������b�%�l����_��F�$������QFH�TJ�/I�%����`��,���-��6|ewv��������k�[���7�,���[���|R�%����uJ�a����q�����%��2g������tK@��$=u9��O��1���x��_�����������8��Z��B��/�a������C������D8:���m�����K���������-�����F�% _����s
s��-��������Z�as��v?/������Xv@����:&�h`�sq+]P�,ZF,�k�,�|2�53;d4��!w�;;�}:f��y,������M���jr��lW�����r�/E{��������+i�1�l�gX
|�_��\2�{�WV������kF�}il�j]cC���L^�N�!$_�Kk�~���i������:/�f��6�E�|� �Bv�y���[�7o|�/"�5e�������i[��������<���P��`�z_A�a��t�Sd��������V;���G��a��O�s�Og
h�B�������n}��F=�+,���}a�>6c!�:y����_�~k��{PD2Wn1�"�������n��\�Z��%���m��a������T��b�K��u�����K�[?_�ua�h��8�q����G�=.p�nQ�0���u����� q����e�_J������K��(y��p�u��2��;�2"�=]*+-|?��]�B��ij{M�*�e�����}]x����4�	0)g��)k���1���-�������������2�=u��+��AV�,@]Cr�Z��^��a������e��&-C����U�2-C��U��r��O�r��O���.M�8-C�_�U���JiF���e��8|r����i��l��dD�&iY��0����kN���,������Y����S�h-�`�Rq�P����K�"qZv���5	1�e����9-�Q�{������T����C�\�[*�=���V�w���������8��������#-�c������l��lR>7D!*k�����>�~��k����6d��-RH\��'��
���	0?muLa�$=p?�T�`�8�$=���hj����l�!��1��S/5�e���]�|������������j�����3,����\��I�\���*�8s-"7YV��s-&�]V���I$����f�E���R�#O���uW�����+�g���&U�����&/��KO��\���*�u��4�ZS�c�]^&PylqZU���b&��/���`~����o,��%5�5��h���C�h����������L���q1�C����p4>')<P��Z�'�sX���D���|�y�|�&_+-�mRQ�JO�6�Z���-W>����s?+]��:�D�B���3���:M��'�G �Z?��/��Kw����M���u��HR��glb0�B�T�K��m�O���F����,W���N�}c{`Rx"�~)������-W�/f!��lk�bV�&�>.<��f��6�Z�V>.��:k�.��6G]�_�\��.�k�C&����7�-^y����o��	��{�1Z���8��m�����������D\��<�������G�r�(}����^��t���A��3bK����yFD]���X��E�eE����x��	�n�����hD��n"� ~(?*)� ~69`��*r1�L��#SCG����Q8�{�/~�#@����>^���GF�W,�$�}�dI�?a���g[��&�e���-@\�Z����91�L��7SCG�-c��]o���o��&����-��-�o���f�#�2j�`�!�F��(���U�����Q���Q0�X�����Kr��0��m��;}�xKT��V� �69���K�\FZpF�*�C�V�h��`�
������&!3�dI����)g��wP`���z�
QW�T����l�������$7t�vnQ�����-%�B���
DN���0W	�N��G�K�h���0y4$G]|\��mD]:h�>�����~n�6n
������s�[S�[��#r��=���b����y��P�%���j�������O;K��m��?��!d�8"�p������0k��7��Bs���j74������n��f�1�����,*c@\^	Fa2���>�5G6L����=J_���1��������"%7���~c�i�Q��"��:P� b~�xq�tA[sH���~/o���?h���0I�s��gP�	u}]<v��>J�Z=�"Z����9#f�����[��#r��=���Q�l��S��xA�x����(������j'�6k����%�n�����f�Q�U��+n���pCc��|�8��z��V�u ���:�U��mR}������w�����|�
_r�����w����o����m�������#s��+��G�r�~�x4�N"r��Ucb��b1����`����i'u�+�DT��Yt����-�QT�O�W�������aK�0?������w������+�[���sF1x�,Gm�L��6x�V���������Zb��<���a��&��\^����CV���x9c��a��&��\��a�r��y�/�a��$��y&6x���aSo{PRF%@]�� ��k��E%�|�x+i����]F��M"*e�]�B���sG����v�9*e��d��?6�A����p\��\�RO��5�U�ig��:�h����
v�m�JD�0�+��1
w���"/��`�y�����[�0�}��dF�M^t��
�����
�f�����6��7P������o���t������r�W��M.� ���
���s����w���}Vc����N�R������Xr�7Xc��s��|�[O1k�,�h����oG���n��1�>��q�������uW\���1��VxT?���+&K��1���&�}������,�R�,���l��>�{�����t��}�\&���fs?�,+
/�6^
�
�lD����?�NK��
endstream
endobj
70
0
obj
6358
endobj
71
0
obj
[
]
endobj
72
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
73
0
R
/Resources
74
0
R
/Annots
76
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
73
0
obj
<<
/Filter
/FlateDecode
/Length
75
0
R
>>
stream
x��]M�$�q��Q7a���X�3�������B�w�������,;2$��wfe2�QUCF{�������d��������}�b�aL�������z�����'�a5wo�0�����'���2�������-�2���_QR���������ak�����?�D?�!.�n��_�������/�������?�i�Wg��%��~BC��������O�?�<)%���'���`4������0���	������E���4��0�&�0M���)9
R�
��0����F_�l���aCnr�go�E�COa����7����ala8�\z�F�9�lX��w{a�M4��f�S\>�� �M
�-<�F_X0�Y����E���/
l6�L�Ia�M6��-��2�-J��i��1m?�a��|��[����s7������������;�����������3v/���fw�>e3�=�AZ0�!?��Aor�`�.g:�R���7H
`J��S���/!���SS����G�|<3��b�e���(��D
T�9
-�T���?1U3'��U�"Y��s�p��#S4���|�AhR�-�(��s"�AhZ�@�.�~
L�9�����i �M
gNS7�L�YV�6Eq�)�[��VnRr��}����&>��Ia��[��f7e��<�<e�[B�8�`i��2���(�~5�Lj�N�����M
�^�{#O8������7�*Y�=v+����|-���/5v-�!4)\�+�;�O������!���<[���z.s�YpgC��&��s�8y�����u�����J��,�^����^���}�}���k���Ia�wM[��u�8���{��^�0No����o����f��
�
�����s�O~xgn�^p�IP%���g��~s���;������g�HM
W��1S���Yt��#�{hR���F�Rf	H�fA�
XX�����6����{&��,�^�������	�X����wv�����RA���o������O�P<�<��0@����#~�9�q�35
�0�	S�
�zA�q���	��Q��	�LG�����)�b�'��a�J�P�0��*�a������k���@�,g���%�&+��XCn6s�a�����s����2�(��{���S������y��L�A�N�=tB�)uc^���,�p�6����C'�:��������>��sB���O�P|�B��0����$���S/4_���00�e>O
I��'�I>�B���/Q(\�]���&�������<��-��AE>KfHL������4%� ����(�(eu6SnR(u���	0��s��@����0t,����<���S����ydJ�AP_IwP8Q�j0d&hR85�����q�)�x���$.���7�����%
$J�!|]1��
�LM
�k}�O�3Po��</�A��0\Z��8�)Vf6Y��5�s�F1�20A�� �����Er��'���'bXx��&��iO3���I<&����tY1p�s*�����3aJ��Q/�Y��	���uu���+]�M
���S�����@�:t<].
�I����`��x�L�0�&�AP�(u-�Cn�[�@�d8������oy��L�Ia�Iq:Pa��:P�n\#���o\T!�i��E�^�1R�1����;2%������`�>J���{E��H�K�\i�0]�7�@�p �'�����.��*�rg�(*��q_�r���(!3FQ1(
��k�;�B��FQ2(
\����ae��@%�E��4���p�v�
�Bp�au�������4���IhAE%��0��Ai���Jc��E�O��,�!��Ai�YF��ey��0�gQEf0$Q4(
��
���*��
�V0��&������u��D!��lPBb����e'<���$��
��a��K</�L<�BG0���� ���^�
��qmk�Wb/

���+���g�5�']��3
,�	H�,c��(a�Z��r{`���	�����( �('�(�	�4-Cd����2`FyE9��e�������&�6�����!Z��+%�{'5P�q]�2�=��l�������O�J���_�������f��K���s�����NXd�1���4���_��5�+2}dJ��Q�>rF��5�M�1���4�M������7���"��}��Ia�Mn�y���4����a�8X_^�j�U��C/��� ��=_9cz�d�H��A�����I�������9��9�R����\�t*���962%��Uz��&2l�� ��������(��@�z�ti��������i��0���4p�[�y�:7���e�����,XCGE�_3&�N�S��[�V��`j�����Ia����+���Dj�����L�4p�k�;Nd�@
�E$���Q�mE�N�K
��'��7���~��a��q~k7����V�Ia�M�nk�N��>���a�Wfe�q�����<�0������]+�
c'j�Ap���2�;QT�
T� j��������() Q�]���FyEAa�~P:��pq�v�������,C`��m,��$J��3_�,�N����l�
��QN�
Tm\����q�(sj��������x-'�y?p����T��0VI�>M���r�n���2�e�XV}c�����% Q�����TNT���j�=�67QJ�
D�	W9
Cn���`�FuE)��a�W�N��M�"�a��6l-��:QP�
T� �����k�A=QQj0Q@(
���������b�D��4���J6�����(���Xj�=_����"�G�d8��1�	=��a�{i ���Iz1�4�tz7P�g�=��!I����U�G#E��Wf� � Ri /��M
'�u��4�Dz7��
^�M
�l��p��P�YPgC�si ���A���i3�&��o��x�Q+���"/��"7)�I��(u��#�N����J�E�Z$�����+R%� ��,��z����y5����R7��@�4�l�"W-
�����Q_��"S�����vTN��f7P��"4)��RWw��t"uEr�nlR89��1�)�M�Fu��pp��2p�[��<�r7�����`F�J]�f$���@�aA+X��e�;W��n ��.�`r��������]���������~�}w�H~�����m<�7���|�z���w�{����;�}�"���;�}����w���������2�������^�b!��BZ�]��7-\��@xv��W�������N���������s���G���a����[�0+�0����������8�����??}������w<'_:�/�~}�����w?TaO]2�}�T����N��`�����x���-��^V��*���
�&���^w��:���I����Sw������;�=��M���^�?���u�IKo�Z/����5y�-�]~�/����������u�Oo�{��-$��4o#��������o��R����w\�R�~Z�����n���b^�*�����B�w��+��mu��c��U�����2�C5�:��}_	
���(#f�8�����
�6�<���
�>?����}�������&��z�O�1�y�������a�~T���
��0�g/L��w�K�r�yv��[���:��~�a?��~��I�_��{4�����(�3uU�GG���*��Wu���W�C�o��kA��d�����S���5k�g#�WC<^_��o��w�����o����W���h0��6�]��h��XW�Q%��kg'��H��~�����x�QW��2�4u���;!]���������*�+�/��:N������t��D��vae,�����~5��t`���R}����unF��]�"4��q�n�C
�c}C����,������\���8� ?4 cR�R9�T��T��!�zJ����B�����������X��f�G6���L"��P_<���Ud�\Uf��@�-E��� 7��\2'�
pS��6�'�)����B����9�����K���D��$���Hn�Ah$2.�}���F"�!�S��[r=�D
������k�tC"Ib���^u0������j��MEt�����kx*��T�nM"�nkY#X�G�L��TOE~tX�S�=V�T����	#�\��~�^����=�~�L}O�X�M}O~���E}�U�{r�5S�3���D0N���q�O��T=�R�S�_��������y,UQ��R�g,5�[�����]�`�l 7O�M5*����TnX�!���:��Q�_Zw��nv��J|O���=�}��,��s������0�U���2�*��X��}
���3��>��$q{J5��h�-�U�=Fj�'Z��z����34Z�j�����0�V������z�c���������^U�=
���4�����=
��~^H|�W�{���L|OC���j\��z'A���?����}:p�)�������TGmK5X6��g�����W��r��[C~���q,����x�
���G���~7��-���@�����{
���"����\�`�g�Q[q��TnZD��&t�N�
t�c�.b��{��Y}�������w��:|��5�P�l]����*���o`�����X�7�8k�b�/2��r{�^C�8|�7L7�I�Hw`��8*p�������B��!�y�e7��4���C�8d]rm��8��}4Y���������qHv�C:������G��Cy�����u���6��4X���L"�\7����?4�$���d����2�lr�����D6��x����Y�8��?W�L�A��&F�h�8d�����8��3/�!�v�C�CR�}����7�8�WO�HR���5��8D�ca*�������c�����3�2���X�����,���&
�e�LvY4����g�H�����w��m��}|�n/�(���O�=����,d�����i�4��q��"E�S�?y,�i�������wx
���<�j�.�l7'
pK������g�}:�!7����TvK5`��_~i����m���}|�n��j�.�������y��|_?��X���Z�O�v9���z��.���V�m�V0��^
�c�V`]Jzp{�P�[MI�0.�!Z���P�����3R���1
���E��m���3R����7������=W1��j4����>��O���������aDM���E�},U`}��n�z`������
d��WA��@#:�*�.k����B���<���
��i�.�X�y,��>�������,��y��F��
���=t�����4����|�b��^�����Y��������U��4��`=>����9��<Njk��
aw�XZ�/����!s�p
������]v�h�.�"�+&qYn]�w�	�A���������^��qH~v9kai�0�sU��!C����4)�O
S��f����L����y
��5p������x����>�R`}���>�R�]�E��|z*�n�"����2�nr{�WCn89L���dx|��
��-
�e��\+D����F"�'��&�H�����w��H���D��<���!�S�[>�#�z�
�!2�2�=r>����E
p��������������U)����4X�DRvI$5����a���}Bd���4���	�Z4`��':���
�}���m����{
����������{�Z0�3��M��)�/
'�`��/��Ue�}
���u��	���{��SGmKX�dAvI4`�����������AK�����T�K`������������rj�>������g�3�pK
1j_3@|�P�y/�~�.9����D����������4Fk�^Y���0Rk�.�B��!�����$Z�jA��%F�"��F��
��V�;�C����4�]x����K���~���)����6�{�Z0�s����F������#u�z�N�1��E���a��o]�Q��R�g,U�]�^
�a�I��U����d
�a��g�e��PA����i����>.Y��K`�
��4�u�������|���W����������Y8����pV�>{�R�g������7�~W���-�]7����B��E4�-��������vbD�����q�.�z/�0�Mi	����p�pY��%X����k�����]B��
�4�!��0���R�v�n�k���|jy�J?�K����N���
����:p}����R����s��G�R�~m�����*�}u�A�RC�f�[�<�+�4�#�x�.�S������oYiC������+u��3M��������o+�Q��^
����
��+��������o���5����}���qC�\��x�0U�P/W���r(�x����P��*��B�\�������4��
�k������}x|���������^�F^U���
8�IU����W�X�O����'�m�xS��x�z���aFl]s��
IV�~/��mc����Uk��k����sE�>7%������y��^�����y[����.����t�/4��i�����1K�o/�Q�}E��3���NCo��a�i�l�������+�
endstream
endobj
75
0
obj
6759
endobj
76
0
obj
[
]
endobj
77
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
78
0
R
/Resources
79
0
R
/Annots
81
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
78
0
obj
<<
/Filter
/FlateDecode
/Length
80
0
R
>>
stream
x��]���������F=o��{���(q;�
3m_L/�^��1�Nl0�oI����2/���FU�,�POR����������e�������?���4no�/�}�������\����?�N�l������b�^��*y���I��Z�-���s��q0��=wp�K�����?��]���?������5;)���c���$����'r;�r�a��0�]�����]-��	�9��
�Z�^�&�>�P�6�h=`;�&�r$�
�Z������ 9�6�h	��4$�!�r�A@K�m�C�l�.�����������3�H���)�M�����
� �x�����S�9XoANr�(�����E���59�
�PW��>�h	�:A�����]-��	0]�;����@��n&��#9n��0BN�y��ZF-h`���x������$�x����.���LW��>�h=�z�w�9��]-�,�$@�����%`��6��?�>��d~�X<������C���I.��Y���E�,��x`���I?�H�t�(�fR���w���%@7����q�.��#�L
��~�
�Zf�fR���N<�h=��2��$�r$�
�Z�������;c��v�f�f��Y&�r$g2C,�L���4*��UN�aG3�V��������FmW�[-��L��N��^��_����%�5�dSg"�����G�����c)�T?N�g���
��T�n��)��x�Y�r@$��]���<�a6���5�hD��Y���R�D"y�t�����&�k�eS`��u�����i�V���H��A���C��2�|hP��"���S���(�F��PB����/iH�( �/���8u�.<�|iH����C?Lb�{��]3���:���,� �NB��,����h���'d������@C� Ui�]����;
I��'8XU�*/�NB��,�B��K�
���es@,��%����1Rz�0�C�����
���_���I���d���C ���D&�t7���@"�j�Y(L��v�eT`������i�e��5*�(�����i��l@M���:��7&K�j�h#{R���Tu4�r��
CSGe?�DJE�+&�8������������!|��,j��0D���c��w
D��A�e7k���������5y�Oq\��"��CC`�������I��5�s&�~��{7H�C�+KL������z��(�nE\R�`�~��,$��"]�Q����D"���2sf��m��"FA6�z����L`H�����/
z5B��J�!	�G�j��~���;H�����AD��%/�m2c��#��<��D']R���+YO��$�=LB���M�!��7�������=/�n���D�I��0$"����AxP�$4
����:��O>��$4wU����.@
�&���Id#���2�r��&��8��V"I�.NPn��%Rp�Zd$)���z�""��8��6"!Iw����`Dn2���E*�q���y�Ev�&:t�U�C2�
Hu'���D�=e��8Z�nI�+�D�r��Z����$X�v�=�ElJ�X��������D�ElJ�X�5c�f�E�����%���E3cb|7�h�����smD`L"�-`�i'!�FD�2AO"�	3(�a��9���m�yI4�^v��I��EC �}k��M|jn���B��$��Z���@��+~
��o6_/:L����l�E_�5���M{�L�DC�I>�|��b�l�i��I?d���������vG�d;	�;�v)3@,ysLm��9�d/��t&�1B��B�Z.�q[Uj����+;������z� Y~H��hVc/�Lp�"�L�>�3a&��h�p���	���T�%�hU$9(W� A4$���y� �H;��vys
�f��:��5�J1B��Bj|�h����+DKC`���fCZt$jp�N�$H�1�b��d�x�2H&`�B�	K���
���9qj�$:{9 ��B��%�>&F
�Dy������/��q��UJ�3�X�� 4>
B����"�����!���H4������p��(tS�	-���|�
��aL�=� $������]q���_��o)L/�`p=��^�7��&�����dA��L�W,,3������M�!��o[X$���@�g� �b�7.,��+*`DC ���u���f�(jQpmO��L���i�Y��D��u��x?�d���4��9 ��Dh�� )����)I�!	�D����/7�$�R��e�D�~�QJ��I����C���!��U(��@�d3�}m�6���CI�H��G��)���t�F%
��Cno��vR��R\e.5Z�9�~F�(;I6�j��g!���%��5'I�9UT8d!���#K�/:�4��?
I�~W��&N��A��(oS@��z�&N��$�-
�����'N{;�k�oy���Q�v�|ZP����hQ���}��� �,I��5B�����T���3+�cEC����R���0Q;��xp
�t��8�%�A��Bl0QQz��gP���I8�(�S�k�����L�t��~��.�	
�o��������Du���$�!	 ?��������;o�����Sw�K�������c<���k�W]������NE����}�
���kz�t�����7�O���:���������p��l�Ev�?����3�j�&��-���_��_�y��_�����_-�e7s��������Z���������z������O�����o�_~�:���?����p?�^N��oNoE�}��B?d��N�����2�a13;������O�����	�-��zS�N�OE�c�t#�_�~{�v�����E��@�oO�����}��7�����?/s����zq����A����Y"�^m��\\�������
���e�&�Y�e���m�6���m���A����(�D6}��}����D�{�
�~��}S8�\�9�U��
�=��i�3�KlY���{�o]�������X�~����x����N�(�m��i�:��Xe���U���)[)�8|X5����,�2E���a�z�3�l���{���@��-)+�T��G'�z/���;��_o^�&�����<��+�y�G�Ra������j�5��r�-��G@�Z<��=��mu���!�����,��cF'
�^wE�����6k/��{��|���,�����T������2��sw����Z@�(�������"���O[�-MT�m7�Id�Y�Z���N��)����.����U.����y�D�u��"-�ni!���e�Hf�n�^�	���I�5f���<_�x����w����<n[��p�EH\c��f��R������Q���e���[���
�P wE��*-H�.��Z'�o�pU��-�(�l�j��������h��6I���6���Rb���r�N]�@�$��o�pU�����r�8�`%��<��O�wh��j�r��T���
"��m��p��-��R�.�L_u��U�
�j��\RUC��6����2��x+��J k+�N���>u�����@�@�"R��m{;R5�v��>Ra�&��~_
��
O�[��t���9���}N��,B��^�	���W����!��b���+��u70��vN��������ir!{�/�F�/�z������Sy�
�D��\~u��o�4�bo��o��ef��_��]�+ef�����h�g�,1��~���,�f*:��,�����t�3�%f�t_�T�Y"��M��O!�5�z�7L|��Y�P�6c���x]����"�~�]S��������F��f�[�_�\�C�f"S�.Wt�����r{����7%&��v�#��5�7Df	���������% ���C��p���)�M"���Z���fy����G����#H�V�q�=QD�jOTM��A��(blU_#�7@��82�������V ��y@+{�m ��lJ��L���SM���l@^�d{��;�j�a�����%�-�������
����3�a���6��y���6���!�����<�)���!F�yPg���U�B�M,��vX���`�J�[w����7L�w������+N/2eCv���>��[�����0k�� ���&��SC��-���k����Vm{�������I�s��m�D^s�C�>uy������j;1������~q�C�
Y�	��H��[^�]3�})��#���b?s^l�����
��s���Xs�
;���wi�
�.r�]z�q�����k�9@x����o��-2D��s��t�_����#����hm���k��\`~��*��vvg�~�[deW9'x���w�s�h�to�9'x�mv�������'x�o5��t�a������t�s���
X��1u����q�f���m#�������p7���u?n��w�0���w�����n��n��������n���69$n����b��A���l��ny1�����&��r����s2���/�2v��������;���K;O�9!���� ���?��{�9��~�	!#�-�b/��I�������B]��0qi.�{o�Jw/�k �V�7����'��*C�oiM���u?n�����{��	q��C��%x�5w�3����6��An���s}�[G�kRT?*r�5)"��W�Vs��dV/����MN/i7��^jvK����?11K]n��aiMM����Vc�� ��f�p�~aC���Xu��z��I��.�rs3����������M�����Yy!q�]�d���nGG�O���:����G���<2Hx��kn��	���������&�����H�9q|��OO	���������5=|�cU$G��*i��"�#�V�q�U���_�c�'q������@�.��w��u?�d�} �&�7��1qm%m��G&�9�V,�o��-�� ����{d���� C%��KES��dF/_�*������O�t��A%�����x��	`����:���[)���X�-K��3�M"/��T�e�f��U����������E�����w���z���D^��+���/2H�^y��$�KE��$��\��[r����C�\�����~��O}�O�/�V'��M#n����I��3�����O
��������Pc��*���#�-���y	gW���r^��q�S���
r�����������^��(zq��5M�Xr�P�57��)"h�5��Y���.�u�P��)�����S�c����CB���Oa
	�?��;$��P��RN�����>��)����A�CB�/�b	���Gf�<�Q7qH���&�i���o0ku��h���qu������$rq��������k����5Ut��2�.��$�8���'h3�knc�1���C��A����9��������/�;�m� n�q��< rG��������s��2�
�QF��% �t����.7D]�3�3���^�9#H\�8|�����w��	|E|][u�w��p����C�-��s�kjJ����#�C}4�����~`VG�8��7�f.P���m|��!�nc����hd�����+[�6UGd�M�!�G�=�,f�_���sE_�� �f��[��.�PE
endstream
endobj
80
0
obj
5781
endobj
81
0
obj
[
]
endobj
82
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
83
0
R
/Resources
84
0
R
/Annots
86
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
83
0
obj
<<
/Filter
/FlateDecode
/Length
85
0
R
>>
stream
x��]]���u����������?��o�����H��=�>���1�]6��[%����U5��,��T�����%/I��vj�������_�0�?�����N��<no�/+��O��
��o���D������f�����������YyX�C�g����u��\�p�'�@�8��0/���|���������Om7����]������yw ���?�����V�����+���\���Y�����Q�&@<������BgZfM���K:��j�:s�4m
��u���~��;_��a�f���n�Yy��s��������5��n^�_�3��I	��h��qa_'"�Y�����������Yk�A�����S[Y�����%t �xH���t�h�'y��$���d7��fsN	��R�/�U�����9��>��<L�V&��B��c�����~=������M{�F�����(E�$@��/l�d�q��9N�13�Cb����~�O�$�8Q�I!�F\[����$ .��6�`1)�B��!����1z�7���uFGq�u6FG-:$�V�Oc1g�,������!�R�O����: �xHH��N2^!M�?��E�SG�� �o�R�Ln6�r�
9��=w>����J�FU���|�`q��_{�8|k��o�o��B89���Y�� ��O�����l�#��������n��yT+GmnT�t'#r��^-�F3l�v�tOD����i�(D�$�V����������n�Q@��Lc�0�`����T���XE��HU��7��7er�dO[cr��C@h��24m��Pcoie�������i�������#e4)��(G 
�!	��e<��\@EqnT�h'�r�o6�u*�9�q�u���2Q�I�����3Z��:���]���=I:�����6/�-��l�w@V���[�&}����:js�R�LF*`�j�[x�"g���H��h�c���J�!	`��%u9�V]������	������bq�w0��V�Z�]�����������h�Jfo6n4����t��X���hR7y���3j�i��<��B3sas+�l��\Y#5���9�U��m-9�+k��������8MR!h���A���N��y1�&�R����XL$���{���!{`�������A�divw6v�`�M!}�Tf�fcuT% �l��}q��d���h��N��@<$^�{���3�6[��d!2�������l�.�Y(�FY�0.���^������1<��x��3'�����5	Zct��tE�J�����s�Sg�.]A��:I�.�AW�B*�����de��1�O�`�R^��b�$�vj��Sk_zcpL��Z�But���nm�����NJ!L��������>����7�&�V�>�v�����kt����K3aTWq��rR��C�_Q��Z��#�B!�[��$�-��IV��$���<E�
����m�+��h��`���a6����.4��b�E�q���&inOS@�R�ZZL���N�]�+H�&���r���2?�MK�N���z
����D��,�q�����#����������L�J��P�ZY�Q0�3C��-kj��W�r@�kW�[).�v�
�����z�R���a6���
9��f:�#Mn8S��D�@�R]N��d�.�x�/My�f��e����1:$&U����2�j�r
9@+���G���)�I�^m�Zv
��"�LY�-���oL���d]$��y�@&���qH���L����{S�@���������I����)���Oi�����bS$����S@��"9���3E�H�GS�@��(��O]0^w�x=��8:U$�������	����o�~4��H�gS����J�3()���8g������[g�sR&`6�x
heIx�ygj�~6�yL)�b/�-C�d_MQ�!\�LwY�5�xL�����)+�)��`���5�o�C�jG
+���f�����YT6������5�o��_Ck���~�:$��&��j�����II��j^����l��HI�I�����7���2e�k0tH�z���%7�)>Y9����������1�')��m9@)��T������x�-��F#�dK��b�������U�t��.!f#Zl�r���6s��f�S�&���r��^�r�"����r��Q�[��/��}�Wn�X�v�������U�v�n��32KG ��8����/�t�D&i���S@K������e:����$�mkJ�!��8����,��US�X"wMMw��C ���1)��(���%�k]��l:P�LF(d�,��s�X�n�(��j�H)�R��q���	��V��O��hB���Z3!�J!(��;��83@���O}�|E_�b�������q2�
�z��r<�W�� e&3�Z�X7m����rU~�fR��	��@
h�S�8�I��v/f &�z���x},�+�'3-@�,f ���c�f������y�P�2��2N^hd�����������|�����`3?�B!h�e��:��!��bf	H9Sc&����7��L����33)�����B�l��+`�����si����^3S0ufb ������3e�����9tjI���3|�r'ax�,t��c��LM��"���.���u�@�l�)�Y)�%}}�g�j�&��r����/��i2��H�i�������|������e������t��T��L@�C@+_��c�����J�%���b6�!_�g���l����!	`�o�^H�����e������&H�[���sc��0�����`�����e��
w�LA���3o����a6��LA�J��@���T��L@g
���|5���L�K�t��M�}m�����N��������n�l���hj���7G6is��*��B��7F6)s��"{MM�Z���Zj0�C���q�I��d���l�#0�j>����^c
��'UfS���>_8����H�Zu0��<�R>n���l���fS�����&�qPJ�l���Y)�-}�ng�Mq?S���?������7��9mOU����hNT�{����^�>��3e��D�5�z�xM��k�%�?^���Z�����������5�?5�z�x���#�?^��������������8^�g-��:;���?������������������~��~
���w�.]������������������w?=}����_��uj���"����+y�2��?y�\F�K�/+��"�0�����C����������=�Xu;�v���Q��m�������~���f�?��|x��2�a��/S?S���}�)�6����7�}��:�������0�/��W|I�0t�C�!��?yx�C�|N�
�p����P^��F�:��D�RJ|��/?�:�f���w/�����k���G�
�=�mh�G����d>�s}�p�C%W?T��������A��og�=�mgj
���[�
�}��_*���Sy?%����?
��2�~�V�!��q%��uy�sf�K�0��a������E�.u1".�e`#��,�OD`��!���y?��bD��9j�����B����"j������y�O����v�0���c����]��Q+T#���Xa�3����x�}&���z
�����W�(�!����4]F��w�Q`����'e=���x�@��<)�#�d��e�������8,��_��i������V1�X3����sS��17A�?��vj.q�n�Ia�����jx���qc/�e�5C��^�SOcb��1'k@�i�W=����j�Kk��9^��	P���&��T1��R���%���Y�&��������"5e�.��dj�~���JM���JQ�&���T]�VW�v\ ���d>������8}��j`	N���s���e��t���h�r���Z?8���72�}pV�����g��<8��0������V3��\�v�A�����Y-����U98���4 ��A�������?�38��/��"5A�}pV�>Dj������p��SS���{���&�����RS���b�Ej���5�4���#�[WP��2J}j���� �[q���!.�ybnlz���ft�&� w����w��i(���X���w���������Y	�R��UV��[�U��Y	��X3'� �\�L��*��'�$.M����U������U��?UT8��c�x����R22��-�/�q8BW�>9��@r���;P���,u�����hshu�Yz]���:��^�z
r�^�x�:�9t�e�r��m��0��<n� jw�Cr�^�����r�gl���S�������*��U8����|z]��Si�_7W�W�7��p���QA�����=. ���*n!�������W��=�s��O���*dA�/���G�]����k�m�2�`W�b��(���=�n@b�����k%s��q;����{^���_��d}�])_6Q��?�l���]y��((�hD~����R��.���>�����o}��u!�C�x�z]������e�;�����2�8�����f��s��cM�.z]@�RE#b�^W�����I�����F���F�p��u[�e8BW*&�d8�O���r���TOj��t����K8��a{7,�}h���hs�#�+R8nf���j����z�����ng�gQ7�|r�y�]�c������{�k���=H\�wO�����J�~;������,s���Y���+������M�3���z+���-��H\<�] ��M��w�%��]�"vIG���F
���s�B�HG9�����5��������:��v�c���#lG���r�����?{�����%bwu���a�q!����)��8@\���R`~��� S\������*�e�.��~���i�7���y��[����v��"����n���)���IF��'b��1?��*���JF��[��[�i���QS��x+C]�8F�n������.M�rm�N���n$�f�w��� �O��N�0�Mt�|�j��G?����72�1�����;�!����|}s���B��P ".�e"���G����Q�9?�����~��sB���gDb��[���c��wwxk���OE�.}*"���@�U[�dHn�%���OV��Q�dei@]�����OE���0�p�G�Bj��N}*`~��q)0cG}���+��%��T�6��v<Vk��QG�B@]>����S��u�5�����G�
m�H��)i��.~��r;T���Q_	c?*n��6���Cb�^W>0}����"�����@@���1?���$0c�C����x+ ��bET �����}6��e�:G]��!��Cs��ZfX��UD��u$���F�^�0���Ws~TI����]��8������gE��������w_�A@�2�E�>C[��\w��r;Tf�z�Q��=+��Y���������+��dF�o�*�5�����w��������
olN��BD��k:G\8�i/C\��=5��kro��5YeS_f�M3������K��O,x`@�gD������8��X��q��d)�w�O��}2Q��>e�`"��P�L�j��D��2e�K��L��z 3 �Z��P�<Q{<o�?���Z�������,@���|�En�������+��#r[�#;���.���a.�@"���
�|�"��&�qq�
W<Ad6@�V��c[���"!j�<�L�y(G\��q����<����y��e��r��y1W|�1�!D\1�<���]#b�Z��6�9�����!F�92�����h}�[��z*W�F����JA��	'�\s���!b��3D��3��E�x�����'�2���7���f}�^���0�O�������]�L\<�����y�z���1���})����q}��k
��^�h�0;����3D�3���<v� ��yM��p�=���'�2����+�
e�e>r�cJ����b0��C��q���&���y!���|L-�W�G#N��q;����%[����^D\����}������-rnT���������-L\{7���m�(�v���#��yq�I{�^��s5C�`v�>D�R�:������o��N�xl#f��I�OO�!�w#b�q��*�]�P��k�����'�!���1"._J��[��cZ�����q}����
{�^��s5C�^�������;=� G���PD]������{��/���@*u�[ F��{R���<5��q�".b���1?�<fDl���\?��1�?Q	1?W�#|�8>Wp�6D]1���&vx�3&vH�����sw�����'����d2����*���;�J��E2r�Y%������2f��T��.~�|����X��h"j��TD��2�O\�0�y��� 1k�h��$8@�3���7��������2>;��2��-��4��G������v��{�� ��h;��&���2j�d�}F[���|�V&#��T1�S���g��2F;��2N;$#�t
7'#@�S�A��f���
X��g"������U�uc8��6o��������*�r`~���Ud"��4,���P�A?*��"aj������8��n�o��R�F
endstream
endobj
85
0
obj
6864
endobj
86
0
obj
[
]
endobj
87
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
88
0
R
/Resources
89
0
R
/Annots
91
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
88
0
obj
<<
/Filter
/FlateDecode
/Length
90
0
R
>>
stream
x��]]o,�q�c0o����J����n���|
�,8k_d��d�1�6����l���n��|�����Tw�UEVqz�z_����O�/�����_/�����E�8xx5������������w�9��V���q�������y��k�������~�?��a���e��/���t���)�����f����Z�Z�c��"D�t�c�g2�	mw3�,���H$v�C=fz� �����3������v�0��6H�@�.t�(-�d�P�}w�8"KF;&��A��vI�d�O�4���H�E�&t��Y�f�.r�g�ZV#h3`9��	]m�aIi3 ���BW@K�6���E�lBW@K�dm�"�z&+�LV�YA�S9��BW@K@p���4��BW@5���A�H��Z�390���]�
h	�'A��R�+�%`�m��xb�bq�X0M�
�"G���d�O�a&�����C,��3D��x�+�%@�0��y��BW@K�,�L���������U��������L�3D��x�+�%@)�'�r�I������.$��<�x�+�%��0��y��BW@K�*�L�r�I���6�`R�"Gz�����'�r<1��8J,0v�� �A�h�����	�D������D�!�390��/t�Xfr�������3[w�(�L��<�x�+�%@�0��y��BW@K�,�L�r�I�����Elwr`>��x�+�% ��d����jg����_�������E��i�j'�Z��e���?��fY���.^�_�;[�(`��

���|�(������n����j4�7��%U��'*�:�����r�cR�N���}0�=���@�d�P�n�6���fX�Is�G�9U*��qD]A$���L�6����}
��hc%��'a�`���}�0�Maf}ns�M,�7Q�-��Q;o���������=	�g@���������w���0�����,��b7&&�a�nZ�Z�$�[����Z���f�*��Q7A�d@�;���2�<O��������b�q3`Nb���7��\bq���	 EA$R�������6m�-�m�E�e��k��Yyu�M3Kl��*[��7N�D���r�.���=;Y����l�5�g�l���B�$ca����������+9��<�K�R��ZETNE��+�d�P�f5KZ����Kjs+� ��A" ���=�pk��h�EX�t����S����)���(�D�Y��D2����Ij=�}7�T4w �}H$X�v��v����4jA�]��M$�����X�U���+�f$����(�'{�����bXGa�`��R,OQ�]��E$��fM�1����{�d��
{�"[R�������;$���]$H����h�A�\'I�.�R�;fyP��v��.��D��*4�nL����Jsg��n�N�:/�\2���E2 U~>�d�twuge�����`�x��mY3��_U���wD��fYE��`U���9"�$tQ��-bUj�Qhz�t��Z���\�������5�p��6V%`H�b�kP�Y�j*g�dUXI�=C;N��-�����J�Q�7���B�,���*��{��E��v��H��m�Oljq��`MJ�Hp�o����c�\���������'R�EJ�n9�d��D�7V;�j���"��@��t�
(����H������J���	��]���,�r���E�����S��G+��$�$���rZ\�BX4�8|���"OQ�H�I2�����[����X(��a�7J��||��tZX{����,V���z2���P��O��Pl��X�(�q�GGT"�^a�Q,H<��k����1Y��q�i-�#��g������N�w+p��Xx����
b��h�n�"�&zk�xE� ��:
b���\��AI�>sh��\�$d����*�j2�s�i���t��D2`R�N����L:�1�&H�d�����EI�v�'<�owi!L��
At"OD�XEZ�S�'Hs���Q�-&�p���&a�uJ�L�~�H��OP�K��"A$���dQ�R��<�����H����vlM�:��8�O9��T��]g%����N"���J��Y��hD���lD���)�{���Y_(5���Y.LZ��c^E��S�@�D����n��"�y�k�"�����<��QXt+"�Z�C_�Hq9���w4��l:gy�k�&�$����c�<�u[�s`���Q�7Vc�`��"�f@	'��I�tv�)��P��'�d�P���4���t ��@�>+?�]��;��G�HU������1$"�>�a�@"0��!���5�^7�CU�E��QT	�DtR` �qE��Jl	(���A��<�8 M�H$�-7������-#����4����,�`I=t������AJ}D���g	H���YIw���GTE" n<�m�~�D���Q�c6�$��{��Q�]��,b�Lk����
�"Fy&�0B�-����A�rQ���-b��j��h	���~.����|��>�y��w���������u��C��5��aL��_��zP���kzt�������)yM�������3����5���#[��;+�5���u:�3��������W����.�������������}��������^��U�i������������w����'�������xW���[U����?�=��? ���"�n�l������'��\�=v=3��:f7�G�_�
=*e7���K�v���c���������'�����_%�-]�K%�<��C��d��?����������
wi@{����.
���#qi�����������c~iwi��P��]�K}�`.
�_�0��4$�������ou��|����
���w@@�)����7�X0G�T�X�Ge��X�G��n��B�n��e+��l�9��._���G�,�<�~��#u}��V���a����.�-y���D�m��{�ph�)?V���?o���y��Rx�u5� �
W�R8�~q��=�C���+19�����u�m1 �R=)���B��v5G����!�r�m����>���
�|_��'��P�:a��vT�5���y���d�����G��1��q�[�5w��=��*b�[�������wL}��k������3+"�/���i�Q�Qw�#�[f}jwA��7�u���`����d����GK�9� 4�}��Z�����Z����
�w�y�x�������q��r����~��*jY^�8��g[T1�����������U���������2Y��,��JW���)\y����]�,.Ho��k�u�c��O[��k��n�MfU������s},\�:�I�"dp�&D@x�
�����3�.r?45�"U��k#U�!����[;A���@~��]�8�u}�&�E���q�����
�v���t�Ex�
$�����V��<l���a���$�[}����~�=�n�(�Z����gm}NI|0?5�������<���0��j����u����
�}��,%�<�{�Y@�R|����x�C3�}��E_���� �������d��"�Cu���b�&����6�<�}����UB�������i�O�X�v�/v&f�e3��4M����7d��[���}����'���w���V�|��{,2���(74�n�����a��
�����)+:-��#����^�#�	�����2�n_��C�������Hu���_�&_��Q�R����903*��k�>��+O!�>�x��W�Y�85 ��z��>��;����+�

����T�f���O-�/fh@]�W&O1�G��#����^�#�y
s�p�Rw���oY;14��>�!A���1O��Z���4j�u�_8s�p3����~��^mN����	{%Dl���+��/����vm������A����&�)�u78%+��l�p�$�|��������;5�P)R7��6J}X	Q�7���[[�nS��#�~ljH2,����������#�~�,^}�`ouB���E&n��r����Fw���X�s-���k�o��y��6�n� ���6I|^�c�.��v�&
GXY��}Z�����,�_3y��[I���;eZ��3mo���2m�&V�����{#�~�*���7|6��"$��`R��]�8�1�����D��v���Gy�>6|2"q�]f���u�>+�Y���A���������1�c�G�@���y��?����v/A����I�����{,d��q�����A�hj����f^�� ov��o�>��+S!�.>x���������� ��Leo}�1>��775��r,Q7@7�h2�v/���^�x�y��������!��v�������J
�[�Q44���O�f�4��G|��W�ii�q"�^y�����h_�?n�g4`~�V^�=N��M�Y��W�]���sl���?u)��L����Xz-���5��c��75��b���e2q�n�1�����'�GH-�o���=yK����~�Lu5V���Q5�D~T���z?�^
�_����^J����Gc����O�)�.!B*�������9	#���Q�$���;l���4=I�98������N#3��f�������KF���K��Ok�$����Q���w�$�S�$|��P�I��-	������T������[������{������Z�-%]��E�v�-Pw�����h�~����������q�(�!^������)���{l�f�;��O;C����\Q�>������A���9&Z��X7C��x�[N�r���{Z��	�{��n�D�Q��syOm�-���S��8*�z-	�w]���]W��+�!��x����;,��������;������F��#��5��b��������}WcpO��O k?��--;>t���q$�Ou?Y�V�?���M_{)���p��y
 �os+�^�y�/#.���>�2�sz�4����H���C]���?W���^���X��[��������ie��}����^4t�W�-]�qXzb{��
_�kh�������_�b������?WD��P n}>��;��Y8�}� ���K�B)	G����<u�s�
��d�@A27t�x@�����
�]����F���������-�
����f�Q)�y	Y��`��O��=���/����t����vNb^WC'K���N"��S�c��-���A���C�a���"7/���m0�u����,�q���$ A���������I@*���0HK7�d~n����9���n, �;������f���Y�����}�
endstream
endobj
90
0
obj
5439
endobj
91
0
obj
[
]
endobj
92
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
93
0
R
/Resources
94
0
R
/Annots
96
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
93
0
obj
<<
/Filter
/FlateDecode
/Length
95
0
R
>>
stream
x��]M���q��Q7���g��X��M������!�4�vu� �J���&��������(�`1�����/��I��O�{s���l�aL�������S�>�����n��a��v������?�|��m3���2-��m��%���7��4�iW�Y���>���p�'��<O;� ~����w����r����:�����v��[�#i8d�!����O����-�9����m	�C.�tK�uv��c�i rIg�G��Dg=]N[��M�yO��L�v���i��1m?��8�M�|on^/g�����?����`�o�V���res31�������8B�~Id5g�*}����4�O�:�S���c����v���g�$.p����J�n��?n��4r����v@*RB;M��=�+����H��OI8���Z���!j�fk�\b��K�l���0����g����z�C'.��C.�?�+q��:h�Ih%�����/����&7��,���n�Ui�5�m?s��V�>���e��Rm�
�i��l��Y�����
,��.
m�-���p+,�BpHe+���W8u�
��d�!7�q'.2�������H�6��n,������t
`!}��V4gqt����V2���@����Y'@�)�K:�����jf7	3�8���q�n�Ih%�7�� ������?���K�F�w7
��8)��
����|��6@��H������Q�
B�������M�x>��������,��&>������N��Y%����`$k�!����tY<�B<Y).~��W�� ��7�$���k]?��$�;k0��������f�EKO8�@�h��UW�����,�3�@mD;_��&����3��H .������#� +�! ������rA�Q���M��&,8�����(J�*�@!}�v�f�(h�J: ��o��A�����KQ:)�L.u�.�s'��6\�����u��C
��+�x��xz^�*']"�sG3��Fih�"eJ����!�\��rT���������k��}#;�o�X�u�����&��z^�_���-^�8c'�,����/������@��{ag��q�x[i7t�F�[a���`�K`�������3@�\���0v���� |,��3�-�>��x=��Uc_�qg^��8iN�)~��_�x������|	�C�n����1/��R2!
�E��%������~��U6
�� ^s�����
��Rs+���5
�"�R���[�jZX�I���8�+,�(C�O����HG{G�7h% �V2t���H����W.�'^��v^��0�Z��A��EUf^�*Q������$���k�KV5/�������P8�`���� H����$J���^��%j����D)8����* �m������D�Y���G����jD�!��^\tDU�UTz]'*���7u��Z+j��F�e8������V�u��X��,�e��B:���m�����p�'gP�e]��a ��:��R�Bk��E"8��F���+D}�L��*E�&
�'-����u�^\VL���f��A��e�@�4�,��+.�
N�����[1Kcc�����WPU�y�"�� ���,�MTy;@��Q�@>�bvT�g�FkE��R�lmR���&�>P�@��*������FTx}#*���:hx�c�(X$J;���,��U��|Pm�]��qP^���y�
��n.�Z�t��.��x�o��ap�lam�	m&H��`��o�u��Bs���$e+4q�/;��<K����D�d\z��d���*fzG�E������[t?����D��/��,����*�N� �"��0�/��Y��g��j�����d����F����4�	:���<�6������IQz��4@^$M*��C���:��Z���0e��h�(<Q*Y��}�6i�VT�C/�������d�:Q�5|`	t�0�:��:Q�6��K�k���b���DQ���,.}u������xg%���i�E%X�m�!��X_D��b9�F�����-����y�0h0�%�`������+������(Jb�`p	�^_)��Y�;@[8��f��w�[{}���i�H�.���#��g��C��gQ�4I�%�����8���v:���yE1H�D
\�mE16Y.�;
�V�zN>��!_9���i�J�����|��=G�Y�26���Q(W���Cok�rX
����[Q|���j
�vL�m����@��e���v�����
g��&�`����*�!���3��,�K�=nG^N-���w����b�����~��F	�������^g=+���5(�
@\Q�&�67M����*ipY��U��nh���>1S��qX��,`d����S�=1Q��:�0��tTd���Yf�gm4X����h��o�?�YL�5��,`f���~�F��lJ%`�^5
��I���������-�_5��1��)�OL�q|j�A�C�6�di�W�"�R2.�S6?���2��6:,&�V����$���9��Vb��ZVL_}�qb���A	p�kkR���Ci1eP\��MjS/�v�z>�Y��Y=}���(�b��p��g{A\��L�4������>g���~���(�K`bU��I��bQ��R�&�3��7�n�Y>���XD
_���O����b9�� R���*pO� ��I �I�}�^TFwEm=7��.�|u_|.��F��;@
��%u	4���"�i(-��i��DI]\�����pl9Qe�6�(�K�i/�5����s9��(c��*{�DQ]����Dc��{hS�u48�:kE���X�N����]v��o�4ilYQf�R�.���q�U�B�Ml8Qe�4�lj�q�F�\���+�����2RC����8j�Yhf�L;Xi=AL�`l0Q��6�\kL{���ns���V�M<�����ZN���9}������Q�?}����G����_��5�����e����e������D�������v1���w�|l���;|>5#�������~~������[�V���������� :;�����e���?N�������������?�sm��k^��
1�{7L�/=����~�������~���X�fxTc�����Y�����*��I&��+�w��5�.�	~��\G��K�o+�Sq��i�;����e��T�����>v�+�5?���.�
1�x�?�~L����-;�?�TG=�i5-��z���[��q�������w�zCP�����:�e�H��W���(��a���W���D������]��#Z�=Y#��L��xR��0-�O���&/S���:f�u�j�~R�Co'���;Y�6�����;��j�Q��eG���v��~�NV���:M�������O����cK�7�w�b��.Ve~����������POX*�o��Pe~��(T�?D�J��*qm�:,k?4���^�m���u�
gTm�qGt��
��BSk����g@���%��Jf�0T?^�� 	C=�����q=��G�$e�����S�[��P{:��}��V%�~���i�fsw�5�������>|~�e����J!��S�s��a����y��V�T��U��}2�]�:����s��;T�:��K�hC�d��mMg��1�]�gw�k��c��]����%�z��t;<��N���n��SNC{�NY����u��	k���~~�Fb�'t���$ ���O�zc�8���Z?���HT�_�#�?��&M����j����x|���?<���/~^�tY[���S`��0F���9uw����9���Er��nu���~-�
@��s���X�]|f	5�h6�1��<g�NU�?���.59��v<��7�LmP,�k�!��:q����>t�D�B������%$����n�	)�g����$J��te}H�����j�	I���;�Y��CfV�s�:��\�����:��6�9�\�w���F�8��.�0�����Iq*�k$�).�g����&)�t�6
A�3���(�s�S�]#Hq*�c`#.��:wB�&�T��n���H��IG�?)�!#.���>i:
����P�Y:2�HG
���j������qQ���y S�<��I����h���������d���N7�3"*c�d���_tSi#JL����s4�����G�����B�`i���r�D��B�I�Z3n;�
B^K7�kGv�������1A���>� T�C�����Pa�������h*���� T}~���I���
=���u��*��oUhc�	��~�������'Un��aj�P?XX�~x��R�}] v85�A��]��^��F�q�Wg~��W�r�������:�c��2���Q��5��R���aQ�t����V8W��qFU�w@������8f@�3?���P;X�������� ���-j�VT���un�{mEU�C�b�8(Uf�P�a�,f�"	B���Y�#&
���1����v>�1����������U���]���Xc~��~�v�,�
?��~��V�]yH`�u�E���S��E�
uL����*�s� 7���a�2?9�\�a���sX����C��g<���7���x=�����P��n��6C���:RH�S��'�����M24�i�o����4���&�X���r���R������HK:u�*:�h���(��Ju�rR���u�{�h^R�?��B4/��/�/��_(�r���}DV����B}gI*�x��Tf�{��578���T���G��X!I���?<�R�l����O�t��si���f
m��L!y���\�Q�
��7�r�z�V���5�z���M��O����9����t�<MMg��883����9v�$�L
uLfR��g*s��?������o��P�Y�2����A�����0�7���+]k7�b�2����i��W����B\��:S�����t.);	}�<�1lN�X��h�&�T��}\lD�p���jV�c��d8RY�������/`��d8��xL�:&��cP�Q���Q��w���Q?U� �L���g���fp?�����v������I�@������zO0u�������fC���&��o?�������hcz^���]�����<�U��u���%s��>������"�y�f�:`�E��y��mF����7���T�?8�X�z^�{�e�_`�G������F!k(%���cL�}��v�/��Z'���I��V/s��J�E�������Z����UB�w����Z��+k��v�e]2�����wL����$���u}�������������c*k�9�����=�]Gb3�����T���x�<y�xZT'o[��z�������+,uVw������:s}}G�^�{���/;�u���%s��V��i��;�!�n���J����������%�*uLm�1�����*�������G���y;������q���3kk�"B��u����5������>5��&������/�Ve�����?�>���*|���#�4��M�m�{m���Y?��r�aMS�_���@�-�z��v,3y��b�M��t����g|�E#v<�G�eF������P���x���v�a{X���R���D��$��;*I&R�_B6z�z�2Q,3�Ds@&R�=�R�;��|��J�����zb�����]���3A:�{�������6����,���%�m�cK �m�&���2�ms�;�1�i���t�mq��p�:����_#��5�O�CN�B]�~N����QV01�D�?)��:�f"��2Q��,��Ha~���4bG�D��B��j���3ci����%G�������5K�q��i�1�� v�����h
2N�z
o��\���� �
�s(�o�*�
�:�{�U��$�7�m'�F}�?���w��z��/�}
�K}����NK3��b���V�	I�BB��4���e�h�=^�����X�6&�
b����2���f� W�=�p0��L
����vO
L����j�!�g��j5��7�������^�4�4���
{8�����m�Wk�!�[����U
����W7���h�3���zj�6&[����F\�%�e�P/X�0�����X]&�B����I����I��B\�����n[W5���S���FN�O!�yT���?e1��a����H�����b���R}
q�U�S�{��Rg���
mL<����B4��������{5f��pX�j��W��hV�#F�
���1�)��[�5���3��#Y����x{�^��3�F�
���
�/U���R����L�O%��{��s�o����d����<��C�3�����1����q��hC��c����c
���pK���zA3i03����1��"���h��o"���N�����bH���?Ww)�������c�P,1�C�?)�/�G�I2N�?7�2�Cs@R�+$)�����!��:��<�����?WgNgM�x����a)�1�-��d6����A#v|��l�)d�P�Yf3�k_D2�������b��,���,�h��k�4�)��h��v������
endstream
endobj
95
0
obj
6776
endobj
96
0
obj
[
]
endobj
97
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
98
0
R
/Resources
99
0
R
/Annots
101
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
98
0
obj
<<
/Filter
/FlateDecode
/Length
100
0
R
>>
stream
x��]M�$�q��Q7���f������O^
iWX��%����NdA��!���������cU
���b�_5_d��F��;6���|�l��w���]�;���_?�of��]g�fh�����/x����t�����?6M3������X�a����.��������v`�OLc���#����?�����������m�5����N�)��7��?���?��L�u��;�5���?����
ZU�%�����0���������)������9e��0E��������(������A�> t���=�d�u��������!���m�����N�Cn�Y]-���!��i�����Sv�^�F@�����Z��h#`r��:�
h=�5���?���m���N�N4��#]-�]hH����6�
h	�T����~��]�z��L�!��l����k��0�x ]-]�h#������� G��c���~�����D�O���L<bgPq�T��'fb`�w�AW@���*��?���m���)��{?l�����A�t�x ]-�(���@�Z�m��D�r�O�AW@K@��?���`t����c�
����~X��~�v�F�������[,�5���o����������dd�u�#�68���@�Y�>B�,��g�3�S������fX�k9�a}��L%���04$���n���Q�f�w'���vw�Q�v�2��P>�iz�k{u1[����{0��M,@�\
����Y7�r��YO������������6�
�uG	���������d�7Nf��F��+�5�|��b�$�������:Xr	@�^lt}#�4�?I�����LV�!?$�I��*��|*h7J�'3*��SrG+����rwTj�!���i	#M��������=�VA�L�uJ����B3�a�v���������Q�F�,7�j����r�wR��!���i
��z�[�l��~�OF�IuM�mz��y������B�vN�=6J�`}����1��eW���4��A��K��D���w��[5��e�!�y$�H��������FI��R6X+)�w����r�(���K�2s*�����lF)��g�
�i�b	����s���F��z]�O;��(�UO�r	@Yn�&����P\5��i�%?$��5�\��t�h�	��o�}�����������7Y�Z%8r)@X^�\�z5]����L�M�7EC"@�:�f���=�����/G��`}�f��UR����� -?zo;��
�	��hH�fo:n�����oX��{v�=�!�Z����t	k��e�5&�nT�=�����������u�"{����;(K.��ra���?�����:��4�
�y����=I�V����Q�N��1�z����`�����	��L
���YWTU���K����X-r�xK4$�I���E�p�|�l9.�d�I\��!���t�T�L��9���ez�I8Y����
w��;i�����<�q�	��~=Y���pa�2Y��D��;�p��<����Kw�~�[Z@Ml�)���������Q�@+���q�]wv���pqe��Yr)@��e�O������U�}�DC"@�5���������f0J�hE�3�;�U0�U'�%���)\�=-`!�^�4z��,���g�n�[����N	��S��(}#���p������;�x���S��a����vRq��k��/Y�! O�E���|2J�I-%5F6b�L��~�,���v0�����`��|�]H����^
'S~H��f�/�����B���m��`D5�hX!�Aa��,� M���e*A=���=d��D����B��c���*��N�VtBjU�Bnk��z�C0�R��|'3����F�� K4$����Z%w�*��b
�D-Ow��y�D��Jn��#r)@Z���tj�F�wn�!]�D����SJ�� 
-��0����@2tJ��SR�5� M?��	���j�J�S�4�t�[
4g�^	����X?$��E��,z�l�Rz�����K�f��
��Jd�u�������wb_Y#��`%���!`Du�{q�=*w�2������iA|��v�����\
`�ou��r�������*_c���W�����Fe���^c@���/�74U�l��R�<a�D7��=����~��D@'�����`����FU���4~7�"�T)K��M�4$�i��h�� ����H^�������RUU����
���L���aH�Zp�{�U����)��
��������aw�M���Nm�j�VoH;�m
���0R5C���!���)����n��g�8��J������i�J7��	
�0���R�u*��f:<`�r��F�����j���&�K���U3�p����b��g��R���dXd���4$z����������mLj��i���l�Nd<<����`��u��"	6��d�����~�7m�b���,�M�FM�,����T�x�������%h���5�]
F�4������mX5�����P�'	Uk��G���P'~'D��m����AC"�6	�o����i�tK���U� �������!����y��\c�i*���N���,��2�
��'�`�� f������
`���@6�!���`U���x����C��3�,w*I����h�N{P��p�NM�10�e����
`���E6�!�<���
�����!�|[��TD�Q�)$mb�#<M�r [Q�i�*��@{?��f<���,��L/��������q��v��s��������y���Q��g��s~�?���[c�����ygZ���3}��N|��L����?�g��������ni���bV�M�,����%d�fg#����~�������|�?��������'�.���7T4���_��g_�x�����/�=����_�r���o�]���&���x�dOs��1���G��E<��-��L��E��>F�������a���9�y�����u^�^c�$�����G��a�M)���k.q��7�_/��K`L����;����1�������?e��f���q�������~����q�*�xYxC�U�[���4�!��L|*Bu��A(���yB���cn{�!D��sy�@�o3yq�@�/3u1y�m�f-���r_d�5��q���K� u��/�\���9?��4"~�O���*�4&>e���u`�+�K!k�\
h���	��R/��S�7W�Bxs��<�G�K��\����m�����x3 ~��������fR@]������+�^U-��%�+�`>�g���q����3��/�������3�[���AN_6C�j5/��1O�:�"A\�u�%�>����Z�X�`��u�s>3�y�dg�<Z��|b-q~
�=����fjD]��R�����/�U:���Zp�K1�{_J���%����J����d��8���g�Y��y1k�\�h�xs��8�F�u���ZoN0��������7���^,�fH\��!q�M~���Iu�K�� >�wX��U�B�^U1���S�<W�cuA9�������rc78�~?{����;�<��J2��wKSkU�X����v� �*;pR���d)���_�����+��u�yM0W���`�-���]@Y^��Yd���exd�/��a{u�HT��E�q�V2����b�E��Z�H�`�,Ix$J0�|���������[~��rD@����x��� ��H��Z�i����,+�Z�eA���61��$qi1	�����VY�j3���d.nA�S�1Sl����� q�+q���
.��8����E��S��$�C	��8� .�� ��`y����Z�8T����sq��O%�^�>6���n��UK^��z����)<�:)�.�����6����d�x�Y����R�r�/�A���$qiE�/[�~��+��sq��6���},R@���(��5�q������B�:����]�� �\��ZT����(�{�������q�-�	��'�������(�qh�e�L
Y�gR@[�nH��
���g�������������B�
������?�[Q��G�=�i>jS'���})��<Wy���������1��0�"�*y,&���b��{�bj�x�Y����V���+����J��".���S�>\��+���b�X�/��,V�/'IW��1q�c�S���QH]���s�t�}���r��������u:�=����<
��S�{��|[$�O�i>������������N�� .�y��J������5/b.y}���x�y+,�#�:y7 ���=oy��������!���)�>D\��=I-����bp���F�>|�O��^g�;��o�r�������sk�qX�-~�����	������f/������_~~�q���,v�����g����G�T�?�}>���zP}Q��:�������u��0��5|�����w��M������yZ^�nn��V��%m��w�y.��4{�}N������g�o�������Ed���	%&.b6A\�Yu�7r���"�&��C��|o<��Gf�=��m�A~���e>������S&�ju^�5��3��P���`tv�oq�E���z[��5���E�u%N� {b��Kz�!�Ks���{����`�0e^?eg�[, @��j ��M�������>xH�S�M����>�<D�q���S~���L��*U�g���x��n��
�O�
5&p@��I�����#~}?~�s�s�	��y�E����P�O�)���y�����+r^���������������{
XE�t����1���w�:HP�������|(�Z���~!�/�9Hm[y�SP�C��h_S���gZ�_�C�:��O�s�xq[�5�g��5k��i�������{2�8v5�)lwK�~;�/����Py�@����*�>�?�1w�@P���|�\���e�g�js��1���z)����������n�;_��9�k*������}��8�_����iN�����t�G����}[��:����h����3}�?`��o{�TE��k�����5o���9n�n��u?�����W[�E���uc�	���#����M�K�^��A�m>Q�s_�l*���[������U��R����k�{��n�������3_����:0w������l�t�����%&�t����1���W�r_��!����	��v<���cB/-����dMK8fB���/���~�������������n������G�1k��x/
������Q�9:&?��$/��(S��)eN�5	s��4������������u��)���z����R��	����u������?�;?*��������9fB���W�����N�/ 
30�/�����.1sKH��3������E��k���;����^�}
�ud������>"��Y���+DLn�})J�����Tr1����$���`>��L�Z�-�������_�Y���C���f��Wn�����n���R��t3�y��|!u��rB��Z����D����F�����	�����'��d���"���C��!�����n�~p��
��n@�?E	�+L`�Gl���&0�������]������]>@^�+���*R&6�>��\�l!�^��`����WPG��U�p�R�p�kM�X���k��Y�S���P�uOx��k0�e��W/���K��gw�������z�D�r�MP���i��%�5ts���n.�����������]��.�Y�}�o���+�Y@���������,<���|��9�g��/�]I�������0?��O��oEg�;����-���4bf�8�k����^|�!���_��B5@
endstream
endobj
100
0
obj
6045
endobj
101
0
obj
[
]
endobj
102
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
103
0
R
/Resources
104
0
R
/Annots
106
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
103
0
obj
<<
/Filter
/FlateDecode
/Length
105
0
R
>>
stream
x��]��$�q��P7"+3k���R�~�	�p�C�'ALS !@�������yf���4]e��^��0s3_"��P�W��_l?������o{���~��~�����uMW��Q���<��w]u�m5
S�P�����Z���y�vc�E�4�_>��a�����m��������~|��v)������<h�Yza�I�n3P���7�����<k�o���62�MO�;�h�i�)D�7�7��D���ny�����v��Id��gk�
���Od��\�Y}�d�'�&Wy�F�,F�`#���*�����WU!ld�M��l��X24���{��<M�x<
�F�F�����6�`�����D���?uU7~��O�����
��5����g3�nC����\��}�������0�.��}��S��~��fu����U����������UK���k�F��O��yP������_�_�;�! �������1����d����;[�q�Z����RS,��d��5��g�ij��?��3*N7*No��$2���R�p���<�����KM"CH>.��"��,u}��4?A��M��c��[������al%u�=���.�t���	'b7*b5Lr������^�?�������m��D�R�1L~��2]~���u�s��$��.5?�,z����S���S�+jJ�����
����C�vP��N�(�hR�v�;����.��i��.|��~T����(�hR�~|��/���G�.q��$�I���'��^����!�I�@��aC��n�5UH�`Z��M%�}��#�������j���L�R�&e��1�ov��z�`�|nP#�f�U����0�{��r��"�������0�*6�}�#r�u�8�*6��Rh�T����D�UU�����I�[L��ko�8!B?$�n�>~���������s�M"�&���6Xpd�;��Q������:72T�(IC�U��nT�����10T��Y�����m���
'[�KB����]��09<5�I��G�%t}��M8o�dw���D��Op��#NA��7pq�2$����d%�oO��Nz	&�24QBv��=+�k8y�$�$2�7|�ou����������LE����t�����&&�2�x��yX�p��a�Ido��QiF
S�� ����32���������cP�.eP<w<����')���D����v
��� t��0y�
��G�[O��N����I�|n=k��2�o�-��| ���b�
R�6,P��
*N3�E��_/�^��>���I���Aq���J�K�|��������&�g�&�A�7��������m��%pdhe�&$�@�	R�R������R�6=e�R�S��C�^K�{L�R2�#��Zt7H��V��|���%��V�������1��ZY�Z	�+�}HBM"C8�-�n~�9{"�%��(�]�4�#�/MLj�j�zl���S�F�1��������C�)%#���X��I�50�����.,pw��d|����DC^�:W�3���������\���QH���k��N�����DM|3d����U��D	���Sj�T�y=\7��N��&�Ap�g9����*MB�&�A�����5S��f�2�ibj����s�ef7*fo�	�&e�o���;<3��*3���!sBo�_��T�I�d�T�$2|��;����4������$2�a�y�K��7����
!������sU��D.e���D"r�]U�5��m�9_�6� ;���{�0����:N��L�
NU��!��{����wU-�\.e��\�x�#U�M��$2(}�v&���� nUnG����L���*c���DE~k�n�V�i[����T!jw�}�}g9�f��S��&���[H��T��v�BmU���.�������%V�)����]����S��&�!�C���U;��t3Hq'U���!�N�Nd)���S�p)����IY8U�7&U��������N�J�B��REid�M>���5V�*s��A���|��,��^�J��A��9_�U�x�Rs^_J����3z3��jW&r)�`�QI+�*��W�R@G���Z�RPZ�iR�&\��S�F���L^CJB���`"g5U���*R�����]���� %���&e������A�,������<!P�A����;^gJ�[������-Q=lQ=t��c��-��������&r7*ro�TiB�{��3s��Z��\�2�=3@����T�&U����<��:�9V�����=�bq3���+UF�$�*�	s�;U.2�n .�$2����������jE&r	�d���=+����J��A������L���A*���Id���Q������U��}����������t�\%b2�j�66��b�~YK]o~u�6���u�b3H��KC�$24�r�7s�k��O1��7��0�
!5�-���?�+&u����������2��"��j����nT�N�d�UA�z��O�^U��A*>��<6�A���j5��z����S%:Q�aTyl�?29�������0�=6���7��|��������c�X'�o��D�7�J����96���r�U@o��
���#�o���U"g�U�\.aP�7��X=U=�7�*�#�b'`v����y3H�y�,e�q���D��������f)CHM<�c6���
�p�f}U-ML�$2(�;��V���A
�+W)C���)�B"fuUu�<.e�oL��t��&f2��U����z������z3H�y�(e����UFkrD���z�9\� H�`���Z��������F)CH}��c�Ta���N���aJ�[����*l��N���!d�����,�*k���D�|���F���A�Mg�}�������[���0H<UY�70�K7���/VO��D�
c��������M2���l�AT<�847�uPlq�#�����F��
�Z�
!7�q\dg3���m���!��X5Y��w�*o����fc��^���?T�����a���_�P=�1���o����^��~'����������j������e^~��N��U|��N�U|���������?\�b{[��������@�� x6��W�������?�������W?����?�/���"f�azo�������/|�����~���/y�����������9����fH����������qj��M�?����%~9
[]��^/��Lp�"����)��?��U[C���-�,��oLX~[������>�5G�{�:����i��&|3??���)����������3��y��}��&o�\����o����{,�>��6wb^<����3���H�����K=f���c�y�:���;^��8���s���^
�����_��9�K+��C��L���s&����Ju7�Ii��l��\�>�U��/	�"�������W�H�]�a������:.,����gR������<<`����-�Om]4�K��i�N� (���Y������R��_���������#�iy�4��]�.�����g��9Q���D��T�(�	�J~�E������>������N���W����_������Z���[_�����
p���Kx'��m}F���|>�����Mx$�Es������<A/�l�����G�G����O���)�}���'�i��5v����T��&�Z�07I�>����{M�G���>�����Q^�n�����Y4�!��l�i�fy��
<�fl��[���!/����~y���_�_��u��9���J��K�;{��o�]3�Dx�~��w}n���O��dNgO�U��uH��n�
4o��_��-{@�wK�<M����(I����|=��W��,�M�,��W�6�<��p�~���a����e��3�),���"4����"M�dp����]��t<#�&UY�C��M,I��a�J������"N����c����m2j%���X@�Kn�^D6���_�5���s��=asU�&���Z���A�"�D.��A��u0�J!b/s��@E!0�w��	������jN�z_��9�%�>�}�����J�S��hM�������%��������h24�"�D�����S5���K�T
���Q���2NA�"q
�o(������`?�+����TP��uL����5��T��<9��]~��-!�1�(���g[A��	N��@	\�r��y<W
�B�i�D,4���	d��'�*�4����e1{��u�8���e�bP�S���b	a��8�%������E^���E8;/Q�m^Jr�TxK��K�mC�������C�~	�/s��f���Ia2}^�eqa����B(S1��2S4��������X�|r�)�j��6
g�����(l�v~Yw��6��ca�(�=%�a&\x{J�m'{���(�������(�����uSx��K�}6,�~	����){S��K~Yi
���/�0���H��P�?�b�K�����=��yF���<
c�s����M��r$6�����l�;�,Ui#�"c<���$�=�>��K��LS��@u.4a�/�9��e4IHc�%h����6���q��21�����!p~��X���x���C����7��(����o���8��M�%U�2D��Js��� }3!�el�M7�51��]����{��lg?�b�%F�D�����tnBA���	oPH�����T��!��4�����^o��S������<vB�q9��e�+�=���;D���K��	I��wS���)}a�
��M��h�D�����y��-L�3����Px&��r�Kx&B����z����\�cK����1�^�~.����(��#2�i�������uq��~��R'e ���D,U�cq��/�1{xj�k��.*�^"����
f��)�}4�E��TV�9�N����|�;/�jg�!�3��
'4y�%��0�r����[����"]�9�������L\��}��e�A-� ���Z(��������T��8*��>���%n��D\���b1p�b.�/+ea�'��'�j��l�8��=/�sy�2�k�8���]������jq.�H���)���!�a���s�D��K���WL����)��.�p�;������+s�YF%�}1�%�RQ����P����+j��m"*]Q;�2E%n�q� p��4ZaC��v�����km����_A��f�v���;(B�p��KM�a���=EPJibJiM���S�/%v�"�"Ma`�h������
O�e�}��p�3�D�mzW����&��_�S�7+%�
�{�'�"s(�@m��s=����^%�@�����RB�^��6�~��#������o���K]��	M���
�������\bOL;��*|@�xr�i�������&2^x����_���K��x�j?����#-��gPZ���Fp�Y�Z�%�q[(v�b�8�~6�bGB��#q�Er��a_����8^��q���q���me_D�e�e�l��/b������rJ�/��.3s������B�E�yc�95.�F�����nbT�H�`�T��bx�]�������zwQM���]f6��&��������
���w�k&O
_��F]&�E���do������7&4)��IM�K���R�����{K�/`K1��t���)u>���� �R59�$s���B�
�W���s�by��8?����+!2���?�g��_~��P�P���yXu�'����0�o��/�������yo�/�c�f;9r�+���x���>F����XD����R��b�~�
)�}4l�A
���]Z���E���e�������
c?��vp�Ji�l���B�?����+T��o�k2)��m�a��s3L����g����s4��hx����-o�����Bf�*�_�m��M�D��U/��a�
��"# ���C���-������`�(���(_C7R�$]a`��0p�]`87vS���������?*,�|6�|� ���P��Z���=���G�k����^�VQ�� �q�!��2_��d;Y����od���Y^J|�
�
�w�LW��%��`�Sv�&'����������.��8���}�A,������?���rXD��0�D�k���
T��e�3EA�E�k��^qn�F9}��\��R������q���{K��p���
@����`��*M��"���b��7;�P�M��+�����I����P}��\1�}0F�O��r0F��}�#s�
�:����>ca����S���������G���?���C��#`���t|��S�������a�
[F����2����O����S�_��a=�K�}*��F�q�Y��y��iFxf���8}�/��	�����l�;�,Up#�"c<��6���)
c��>K �y���"��?��*�%���,��e,������3�tE,��s.q�A��1��	��c���/����C�_B������K~���S��X����l3�/�~	����v��z���P����N�g���f4��x�&F-0�g��t��{���V��|��[���Ca
�����-/������Q�'�|��~\b���x�?�����&"�c
sk�+S��-w�7{%��`���~1-n�W"��:�P�����We�e�>L�����X/�^�gwE���}#M7����W�jS_[���b�m�R��x��������waJYRW^b�ax8����8_����<s������R�h/9�/�>��:�C�)��-���
_3��j��^�a���g"�"�31����a����b�����?G���
endstream
endobj
105
0
obj
6932
endobj
106
0
obj
[
]
endobj
107
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
108
0
R
/Resources
109
0
R
/Annots
111
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
108
0
obj
<<
/Filter
/FlateDecode
/Length
110
0
R
>>
stream
x��]��$�q��P7af������|g�U�CXH����%O"(�XR !@�����yWu��3m�i�i�����S����??=�1��������S������g�{^�i��e|�H�x������w���m�6���?������+�u����ai����_��E?f`���.����x���/�������9���.��u��2^������?�|H�C�x���3t�<K����0�����t�n�: t�u���yMg��-t@����i ��t�i^�e�����Lk7��_�0OiW���{������?���"�_7?�o�+[�E	P��c��8��?k���?�m����nLC���i�|��.�w������4:����a5�i5��ZtHh����s��9�����C;��E��I%��Yfw_.g��=-����YK������g�~\/>��W��t��/�l�m0�[/���.U�i����&���d!:�����>-���VWM���e;����<@K��lO�l�3n��4v' k�!�&���z�����t�!�)C���J�nY8�F���a�e��_lP-���a)n#��(�����#_3�Yo�9+g���!`�a3���
���v���^)�DZ/w�t�����8NZ��8�c���I����Y�)-�6�$�����a4�O�������,����p�J���4�����<v��A{�����x�����.<c�����dL�&c������r���
)���q�����n�,��N���C����<���F/�1z����j|-�e~����8[;�I+y��64b�f�v�& +�!�6i�o1�o��<u�v��!o���2��h�Y(9�Q�2:a�cwV�	�C
@*�����#Q��$ ��q�D���(y��(���iRfcW�4	%P�{�4��p/Z��[���M����`����������f�;�o���R��{�1|���$�<@+��JH�����sg�-!}[�>���z2�4M!�!0����V^����Q���I����%0���q4�G�)�I�����%`���)Q�����J�)�@N�%S7����j?��i5�H�d.��-d_�/!mnPSD�T�����5�Y������kh��f554I�!����y��L}b���U�\|H��P����L!��i�R�!Pj_����3�4���J����O���������U���:)���E`��������� ���d��zUW^����T/�30��J����r�����jG��^/7��p"�vSu�P&`Te;������/��A7���`2 :���!0�%�k��TM�Q���x��Jh�N�U�j�@H@J����z/����6UB�)��kS;�d��'���k`��;_�+u��fk'�5)%���0q�q��z���Y�)��)���&�fc�:��j�-��j�@��������q��f�y�C@��C
��\+\w
5E�am�o��������l��"�B��r�9�m[��Y9$�)�����\;��	V��H��d|.�^�����$�����1�@��2�(����8=&�tha�T��������f_:�o��#�E������t����W'����x�Mf-E���F9�����|N���7����^�����q|N���7��#.j ��6<'����@k_[�g�:c��i��n�@>���kw����N{�Z���M��B�
���C7h�@*����Gh�@Z=���:o����$�������tN�z%��6{k��a0Es	h�k��\
�)�@�RC^L�C
`�E�M��R:�F:�)�K@K�}:�rJ_Hs������h��cn:S7�8�)�K������b@�N���L�������2��	�KP�oq}2�,���0�,Lh)��$ �+/��!�������t1%4kg��*X����Mf
h&��H=@�m%���~&enQS?�F�B����[,���D#���|H�C��D'��j�_������a5���I:���H��� ��YM?�
��'z���� 
L�^���o�$i�����<@K_���<S��4�)�K@K_�G`�LM?��{2%|	���q����]S��P��|�~D�����~�L_Z�J�d�LA�����%0����gMA�Z���m3($���j~\M�^F��\s6k1��H����K`Qe����N-��'e���K`�&�B
���b��	H�v/��g���wL��?�������!oM����$�VJ`��lc�L-O��zS��Qv�0�R��$���t|45|	����u�B�l�)�Y)y�����U�����aS�O�)�K@K_[Eb�LY}���T�%0�}�7���4�k�jVJ`�o��$����z�M]��Y�p�Nc���n�J����i�&3�����M��y���2:b�L1M������
�JkZM9}����2R��,o}�M����g���P�����u��]>��WSQ�4���Jz=�M������s�����WSS�6���kYS�!7���@������!��u��5�j4_������7��	c2c��������J`�9qp�ud�^7o���4�������\]<���"z��l���F�@������X9�F9���K@)��O���MM��R��6���%p�IF�le��F5�<�z����f���r?H@6v^��C
`�j����Of����f����[��I�[���T������8�M�o��4S�lfJ@��\����38��f�J`V��no�������1 i63APJZ�M�m��&I��f�`��A	������W3kp���3��ys[��ig�����r/RQ�����L�x������!���L�4}0�%��m#;;hfH���L���^�x��(t�/	MYN����:X3SP]�
��`�}���H�'3CP���~�P���Y3e�J�������u�o��~��7���f�`��A	(�7�7�d�
@���Y��T	}e��
4��L�b&	J@)oG���e0����)��Y���|u��3 ��uDXT�~��I��f����(i*�X|Cc������=@_��f�=S��0|�����!y��K�+7�f�wR�%��+��hi/�M������3��*�H���;6��� }�E��:]GSy#���������J���(���I��m����$iv�T��D�������Sw�����N��������u�����������y>��]~f�}g�����?�A}~���O��>?~���nR�?����������5X���g����yq���B��U?���m���:��U�����o;��/��/���.�}��~������k���i9}����o�?���������r��������w��s��
���|�RE��f��??���5��3�Sq:[��/qO����Cw�
����+��v��~Y�9���y8��[������k%�9���w�������sR��k-�tN�>����J�=7�v�*�����_�R�������b�Y?6�7@��������Wg
������P�
����d��[��*�9F�oOo���)y��s��C�������JE��w�o�5�:p_��Hu��s@�������PJJg��5������qLb�����<���c:��B}V��1:`����5���\?z��P7��qN����?�a��N�#x�� �Rm����u����~+��FIy<�~{}0�C���:��mH�C���1-s>��!C�X;d&���|�L���������D2z��O��"���oL����:dr?V�
@������j�
@��Jf����e�U94����8"��jb�Y��3%���O�1}* ~�&V!
O�KC� B:��������?����:d��1qs���w�SuL�
���=���C�4��P;T:~����	��`l����|��:�W�����i[�����/ur���F�Qu0����
}6"~������ag{���C�����A������lD2��}6&���)���^}6����S�Df���=;*{?���1dJ37����v�;�����k=�A@��y�@��;T����W�1�* ���B����}����}��#�Wu�h�Uu��iH\��%B����%vO�@@����^D d�2k�
@����@2]?jg��W�v�T���e�(�W����_��7mx/�/����������>���/~V�r���%T����������t�����H{������{��a	Wl��5Ye�P�U6ul�],�������������N���n��������
�������v<v$����;m)����'nLL�8 1a�/
s�91�^T�"%�h�91�6�&&���~OS�M�5`��D.�<����p���<����c�uf����!���8���1��!n�l��~If6@�9`�t��*����2�cs�^
�����<�B�v���a���\@�t�Ht��>�D�kKdJIS�.H��uLZr��� �IK��s@%m�JK�.������s��b�����h||��"����1C.H\;��3h�w����Z���13h��y�
���7@SO��1U_�*	����`��>���r�������kqm���Kw�g��c�������#����BE�cF@���!*HQ1@����������z�����G�5[{S����"��xF�!�"�R?)�9���f�xF�-�lPo�O9�7E�!�)&��g���N;@1uL����,"n����BE�cF@�9nT2����R�"��6���������&��60gT���#�j@�-���/�s[.���C%�P?xg�k~T�����P����
u��H����P
��N�K1u�;�D�� ��1���A��`
1�`�+bt
OY���B����Qj���M�/}�X�{Y@����19 n��$�������h3���q�y���U�3$X���mOD���nB���Cb��.#���A����B�^�2�<7��v�)�kxr��u���Q��cg�o�;��yZ��<���*n�I��AD�h����D1q�+Ud��/s��/���YeQ����O}�[�?�����Muu��xp���_��7?�������:�m� ���
L�>d������d�/��i�����ct{9��[�C$#��S�c�:&A��-��q��-d2
=cqm����X��Z�6f����2����D��`>�Z��&"�����[��*�9N�� ����O-���\XB;jV��u�h<���k3gNEc��Q�#D2�B�!�-D\_4�t������Y�#������t���=�[^kD��������x��G0�B�����y�q(���=�����w��
������������D��#|���B&2`~���
L�U�<�����1�q���RD�>b�
"���
@��t�yj�?U�cG�Z��G@B?���vv#������u���������>��E�OA/����S�Y��4`~lyxE�4�������c���
��n;?uL����W��l�k�
@��1d������
@@�0s&�N���_�1tS_���W#��*q�������L������"axv�����_���Q_I����C&�qH��������m'E�!U0"�����?^N: C�P�����X���e/�@@��tq��5"�[-���=^��Y��T@����!�p��ty�}@H;v���}*������:�O��[���O]�����:�
F��b������j�
@���})"�-_4��C6�`����k#{��4v����)��?�O�J~)�5�o���g�{^I���9��jp�OD�!�k���3�C\Y��Co���~l v���u��*Q;����8bn�$������@�:�����U�= �������mk"��D����<��]��q��]��b���(�Yd"��=!���j�2"nz�?3�Q���:�9D��:-���<������y��hc�[,��nq������[��Dv�Be�Xf����ea��sC/�$yQ��]�|C"����r�c7Q�L�c2Q,��Dqs���?�<���r��\��*9��,���D���"��t!FY��cFY�:f��?VW|r�l^�^���`��)C�kt�6dB�#nOo�����erN9 	9�I07${1f�!~ ����d�����]��&UD{q� ~��z�b�a�,�d�������8d^����s�W�.�=�E�Y�{RDR(x��S���P�h��BF���>���������8$�q�W's4#�O�����w�<��cb�!n�"��Iz{�^��s��G���}�(�*^L�.��R�:&����q�|�c����g=gT���#zj@�-��l�[�NE��Be�9 [���39[���x�Q�?-� ���j�e���TDR�z��U/"nx{���P/T�9��U/bnx��'��ge�A���iD���Mkw�7Ab�������qs���cF��^�h����^��2��hF�!�X#���j~�c�����:&���q,"n����B����>�E�"���8h�Q�x�"�${��������
endstream
endobj
110
0
obj
6766
endobj
111
0
obj
[
]
endobj
112
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
113
0
R
/Resources
114
0
R
/Annots
116
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
113
0
obj
<<
/Filter
/FlateDecode
/Length
115
0
R
>>
stream
x��]]�,�m�����{�{~@��q��C�k`��E���/�k?�Hc������/Q$Ev��8��w����.�H��������;���r�b��������~������X��mG7��p��|��B�[��w����9L����8��we�y�w�_�������X~qf�@����o���r����������{7���B��*��9��HC
 
����4����sY�wCOhK 
�t�yF�6�M����)�A�:�"��F[���:~`:��82�����6�"�&`8n/Ri���`��`I�:���H:C�T���l^�3R���]Z&�h�rUgth����1���6�"�#�-�c�U�m�E�M��:J[i�5�}�E�`��NC���,��c�0�q���$w�!0��;���@���ca���k:���@�ca�:=��A��qf���x@]Zf�h`��A��� G�%�?��@r�~���X�)���w�%s��L�!�t�A������n��}�E�`ba���qgth�,���t;���.-����z���.m�6NC
 
Yu��8�avk����wS\�Z��1��[��f�������N������[�m��|7�����?��{"�\A\+���c���5u1t�*UN=(0��Q4�]�"�aY��W�4:�4�g�	H0�������m;�H�_
�?�q1C�4KJQ���n�nS�� ����aXg��������HR0��iF��$�<�%��<�P�i����_f�^�z��C�^�L���g�	HJ0��t�u��.F�{O�;i����)��@�W��=lS>�t��j�X�.fR��hz�������fy�9���I-�9`�4��n����g�	HR0��6�v�]���ru2�I:O*�������o��
�y��cW's������$K���&����Lv�����<�]��@r��n��������'i�o����S�J��^�:��y�~�����V�����hY)
),]�.�����l�!�JN�]"Zn�R���	w����m�Ct��^�����l*��feG�Y��s*�tH�L��s��=3�40�j��2�b����88|c{'�l�����P�����&�����[F��:�Y�i`�-���r�C�Nf�:fh?2��N.W6n�;��E'����YIE
 �(p�F���XW6k�Gf� �hU��J���fq?0����<�[��`���+#(g��� 5�(�m��7{ob�
e}�t�6�Mg!RD9� �n������]��K��.�3�n�������}i�IE`���Pr�����K��JiH����d�
�:B�>b�z����:�x}B�I@9�ub���D9��4������AV 	��`��J��v�YY�'V!��I�����]Y����i��Ld�h]�fcsd�'Y	��j����I���<m�d���:aH�M���z1�Dm|�e������H-"����i6l�F�vd{IY�J�����*��Y�=-U����i���iB�����h%S�Y��nHC
�kK�$� �g���H)����}�g����322�l�.k��)HC
�h��F�)��<�h�L�hUZ0���8�9��H���f��i�����JdP��v�".�������"���,
��hE\����,�0���(��x�G�f�) �	r�����g�	i}.D9���}���dg�<��`GRQ���ez.k�4g�<�l	4`HPS�iF����'-U�x�g8��cY�^I��f�L33m�H;������63����1����Wud������RD��l���Zh37?����Y�fR/�/������V�@��!p�y���}�f&h������aS&&��L��g�y{N�@������h���;�(�}�I����u;~������h���i}G�����&5�����V�U������m������gE���fywP�y����q��I;xb���@��������3y���+�|��h�~��yWN�)�����t�%���R�"����b\~���G���'�������	�yfp8����}CzGc�����q�,5�(�{�B*Kg�zf��}�]4�)n7b�����.�����@��;��v��;��9E��~V{_���������g�����U�>���h{��+cK������+����X���e��(��%���x���s+cK�����]�����n`u,���!p��L�Z?P�0���z��L�ZA�x����5�z6����Gx���D��w��Y;z�����`�PF)X�� ]<C
��yL���rs�T'X��'%���^�yY���(0��QP}N��� rp��?I���CC����4�g��R�����X����7iHpU�z��z��`�i@O�e�\tX}�X�b����Ev�X��X��4����^���R����e�%����<��l;�t�4N���[t�Y��XudYJ	�7�j-��"�Xgfkf	�]quW��qf��`���������d)�;�H���!��|u
��!�A+
)&>�Mw����"���v,��@G���^wn��.���,���z	P���m��v6/�Y*j@��l���qS5C
�U{������RT�V>�/bb�=N��SQ
C
`$}��F���eq�hTRT���?hg�&\���.��Qd��:���[VK`&��w�r`�� ����S�4���z��
�X|`n	t��>6;�����[.V^��V����-�%0�2��;X�gK�`��V��}�eGV�����9HCR���sw�v��_�[k��������L���]~��Mg����=���'??����#??�������~|?_�|���{���M������6���{����bq�b��U��']5� ��l�������?��������2A�/|9r��n�!=�7�77��/��g_�|������?�����O����;�����8}s����_������/_'5.w��]���U�S���r	�n��c�H5�<Vq�ix�.hF�t���}qp��Cq�o?W/���B�ZE�wK��[�K��A����
��:f7������[�/����������W���/�:���u�)�����B}���\�8L���Q���j��D'~�&��4�OK���2<��$P7����P��B�������S{`252	L��80)��I`����%0��|���o�WK
w��� 	�/�G�8$?�a�%�
��96�\"k{�%���\
q{����\�]51�l�� ���G6����#�M`�\��]�R*$��(�	�6�M ~���(�	����-E����cI�&9�1q�Dqs$������\�q$��2�Ds{$��?70C�%�.��q��8�!����8�!������-�����)s��Zfwm����@k���R"n) QR>r�;��0��ysC���B��Y�
���/$?-��9��Ms�}���)���'��W��@�O����0x��|_���K����w_�����xY&������2,�s��������
�6����
qK9��-�s+���������U$6�r���������4</�[Zj�S���O n��c�3��>��`i�?�3�X ~]��{�I+N"~�Nk�0�Mc����=�N�DMk��iUK�+�Dk+4��4\"�)�mm�c���+$����.9VH����j�����D�\�������i����F��{�M	lk�{
sef�}O`^K���,Y"6��%��,Y"~�^�aE�����k�����
�6k�B��wK�6y��-�?+�k������,�o6!�mVS���y<z�����j�S���O 6�>S[�S�
�O`�\o
��
��&�'�/y������?=��|�n,��iI��i����m�|4P�59����
q{� ?4��rl�L����I�T�
��������|��TGf���<5���MV$?U�OG��������H%n�D*q���#�H�P_��Hdld���s$R�k��H$2��7� ��M�r ��9�Hl�o��O�� �k����,K���m��(�)����H�p�6S[��f��c���\���
-"�D�O���6��b'G$�8A%?�gY��aF�j�HdK�"�B�Z���-M-�LmA"�-3�D
ss�'27��P$����W����Dj�X$?U�*�NY�y�\Y[�&0u�D��=3����T���@�`���l�f�0$07��E ~�o��H!�Td�Su7�.~�9Q�������7�=�����|����f�F��|_�ZG�'��ol!��mVi���jZO{'��l]Ie���T�5)4����D�����lk��s{kAb�lp�S$6�f���`�[$����?�$�Hm�{
qs+���-��)��Y��|W�e�|O 6�W����d��	���'.��O�����U�+q[���I����
�������
�A��[�n��)�a����f�����x^��S�"���	�&5�D�P�#��LmA�Oa�=���O`�ox29���M�,W?t�|O n�y]��`TX��R��$����k^����!������lk���I�-[<�����K$~��w��9�����j�S���X���`�=S[�S�
|O`�kxz1��@lq�W$6�����������u,O�|����oN��~WE�������\v�������?��������?}[��|r�V��]}�qXJ�!
��������M����D;�~/(����E�������%��U]iE�{a�r-���iY(�(��m{��<J�q3���������:{��<j�h?�cN���@]��hp����rb��y{�c���������q�B����[&�pSb'������������V�5�C�N|�^���<���Z'����i8���4����aK�{���*����@�z�}C�����vW���_8 ���~����A4�����o��(���K��q8�i�����63��@���p�Wt��jE?s��>t�����J������c ���_����'��%?�!��[rs�f���u��W�C������iu��������tn�)�t1��U7&�k��g�]�A]� �}�a(�������wu��s|�v�-�l��5��M���Y����-
�������A�O���>����(�m/#0����[��A�^������@��x{:\��G��(�m���G�������Q�F�~�z��6�����u
,,.��,�������Ki�H��jM��
�R1�b���������=6���zU�uM[K�Z�$�}�y6������_k>�	�FC���#^m!��{�@����e�h��#�Wn]�yT����M���I����rd�����GD�}�a��@�R]�=�1��������O����V[��\����h�|�����6��F��!�!�n��6���*�cK�x	,�&�����v�w��e^��q5�����������L������7���c�~,���L�-qr��7n#������vc';���o&V�}vL����I�~?��a�����1e����{�15����;�]���4jvK����2��v�&��K��)����K�-k$vt��q�'U��@����oQ�����Z�v�9�����cG��%�:L�X��yW�Hm���S��}{�������2����5���S:-����@}�@S��
O�b�����4
Z����N������M���v
���
��MW����������.�����:&�?6��x1��c��Y�ecK�Glj�d���Vi6�D^�&��1����,e�v_�xS��v�B���7��4}������ac��1���2��~����������U�Y�Nh�A��j���M<Q�}�1�D���a"���'�� dl�;�Z�u42��~�t�$[ o_�|���e�v_�xS���W4����1O�DN|������	q
endstream
endobj
115
0
obj
6168
endobj
116
0
obj
[
]
endobj
117
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
118
0
R
/Resources
119
0
R
/Annots
121
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
118
0
obj
<<
/Filter
/FlateDecode
/Length
120
0
R
>>
stream
x��][o,�q��1o�h.�H,w��|
|		����~���������f7������Zaq45��uY7�����}k����_;,�����o��m�W��"n���7};v�W���<��o��o]3��ph�vA�]+�fa6���������y�w�����'���[�������.e����sh�`���7���L�t�&�&�`o���'<��b���f<$���$\����5z�$�M@M���*�k�`I0M6�M���F��v��`or��5z�^0�6�M�	��cek���x���xx%���x#���I�7��J���F��c+`���*�k�`I0��&Wyf#xf����a�or�gk�`I`z���&Wy��;�B�^��T`n����;�%�8
�D���N������5Ic���6f����t;5���Y���Og��g0���1l"�M����l�356�&+O�f�q����������Mf����r�5fm�75����L$���������f�X4��}�-���d�k�����c3�}��n,�)�Zx��MN�Qr3��D&.�L\|D�&���� �������a'O�
j�b��iv�#\GPc���iX���A��0���5IC4��8����D��n�����!�o"�� ��EMAL��t������Mgw�7��A��P91�� ���`�FA>����K�$L##��/w�.����i�[��3�It�%}��7I������yv��m�j�kh��&j;	j� Q�DS����;��Q�:�W����f��~���9��M3E��st�>>��"V�9������������������F��2�T��c����
jg�}���(��ML=��z�q�.MF��|��mcbr���9��&'h����6�w}�<[LmD��Ff�������m�������(�U*�y��g�}����4�8�� �f�&'si�_��;��R;��/�,l��i'j�D/����E�M�v�� 0�M���[���y(���qmMAD���"m�_���8r��mc��A3I��6X�@XI`�Pt"`�wV?1y��$���I"��CN�>�"e�Y(��B�C#t�L�����N���F(8��w�2?N=y��x��q���.r6�B��sf�����$F��*tt�6uR�������9Ao%'��4b0V;	V/��l����W=ubOV2
t��Q6����� ��eQ�zj�BOM�DMAL��z\^����Z�F(5LM�;����F�|j�����S<�y�M����m{j�RVhAAe����SN��q��
|�;�)���4jFAb3��4
AU��q����M����X�,�����.�:�z��D�F���pDZ3�����ds�6���������	��PN �n������]��f�r�����������	��L�S��l
 h)^ �$@9AL}oF�q}��v��5IC4�w#������^���lN0����0���B�a�+'�7�0��[.�ZS]9A<#wI��^��0��d�Co�����r��0����,��1��(tnEjZQ�&����@����+<���3���C��)j]��M���Z-�(�(�\��(1SA�}��P��TV�����J�A���
T6'��o���7Q�5	:Q��A}k ��t����2�(oSA�}��u��x����M]\���K<A�����I"�o�f=@�D��	���L-5ICZg��� ����4?lr�!�3���$�y��`�UyX�4�Qa7	f*
r�+���W��Xv~��7I����$u��d�. *�$�;�1��ul ��`��B?�r�����|�D���"��v����5dbK5I�=K(�����Q;�F�3X�	�(��ks�s�?�5t�(����$����LgD
�	�j{Q2�sm'���{l��M�I+�w��
v����!���d��!��E5K�$�Q9A/����<hU���EMA�d'-��up��U��`�����(x�+q� ,{�c����Q�	<,5I�=~����n�L�}X��|��v�y��D�`mF�y>��uP�(Q�����[���F���+6,j�M�Fz{/�N��DN����0���%��IQ��a+'��4#��M���%h*0���B8�.T��I�^���q~�,�^e����I�`�����;SNb�k����1��a����E��Ob�10��xEe�	�zi��cT1�c�~�l��(J5"&AX��	��-��bFQ�7	�*NN�An���V�o����C+��D���9�:NTo�z�o������u�|#j/L6#�r�h�a�9Q�������g5hE��	x'w�TK�}����D�8��Z*��	|$��n�[��9AJz�9k�AQ�
�(�Rc��3��������~���$�.�]H�\���l'���(SA�6���
��.�gO�U��&���7�p�8Q�3	���
����>M:���M�;���P�D0�Y�R_��9pzAx�%'H9ol!����x$6'��9<��|����q��v&n���� ��7�(�7W6-���`�J����z,$��`QM&����.�u��(����$���y����]��Dy�	��GZ]3�oVf�3�����D�M�^�lF��s5 �e��z�x���;#���{p��u����J�0�Cd����*E}�lN3c��]W7Q������I"`��l2;�$�	��Qm��..����E����jE��lN���+ 
uP��Q�������=��p����OCT���L��`�+�N���L�Y��� f�A0��4�%��]Q���(�SAL��r����5�)�M�4>5b �&w����$�����3���ds��S#��^���Cs�~��/�v��?|�cs�S?~�����m���n�	�,������������?��}�E�������������������yc�������5]l��,Z�s|�g����i0�
l���������_���������{x����U�f�����}��O�z�����{���_�r�����������3�������w�"�����������������f�_���w����IP������,
�����~�e��{��MS�$���uh��u�j������X�M��T�9-����������/�O���l�8���!��M��>X���h|]��[�������a|��x1��v\�^��r�v}:��������8��u?'���VY��,�xU6���h�q����K)�Y���w�������Kd�`����Z�������v�Ka�ji����7���.�is~u��9��W;\U�e��e��X�tM?'�.V�b����NN�_4
2�}��a����Y�'���o������Tl�KZr�-\�o���50_QF|����v���_�F�?��k����(��d;.|t|�5"~F{�.�x�G���s���kD��l�����?�T��@yu"~]`��{���o?+�#��2`����E|~Z�O�w\��?.�������">�8�����������T��1��K����|�PD�����ay�[(~]�Q����OG�j~!��@����)n�O.�/.AY�����gM8bc���s��3������zrS�����_�����S�!�G�O��*�8��4~B��Q+T�����7��lk�=j�cfP@ej�N�ePz�g��D?���Z��2���o��z��o���i�Aj��jY���a��~�����h��IS�1c��vr���5���T�7V��^'#cE�G7�U>q�����~��X��f���j��eJ���*��E�*u���]�`TV�����sc�L�i'����?�!U����qkDB)����<4�i4����h�!�|���_>���27����^y�@���N?��v6}���e�$s;\�~v^�p
l������O:�G�R�]�O���x����1cn��Qn[��~-/Y���>����YT����v�Sy%B�zYp��^��i��]]�t���B/��h�]���>�_��Vy��P������nX�>����FH�|���=���^R�F��f��y0�/��_���}*_P1�x����o�p(��W�^��fY�����jC�ow�����z����� j���ny]v��.��o������DXi~%x�n^����+��{�������#�V\8O�nuD�/���I0K� x��*K� py��n3��{6�[�
Ll���91�]-����Qr��T��$i���N�r����p��cA����#��.��K�ZxB���[���IA���yT��!��@���e�k$t��J�&�teV��"V����b���/O��`����^��B�}�Ju�/d�
*�Y�K��+���sW�i���W5��K��� |������*LnAu��'o�mSy�K���:�pr��%�^7�W�Q���Sy�gv��_5����b^�H�]b�\���b����R��s��sAd���������/�(�Q�Q�nQ�p9�v���p�9rk�k~R��0k��y._���s�<���=���YF��{�Y���0Vw�(�W�^��`�(!"��k�9�xn����Q�q��%�U$��]f�x���%��e�X�^�Ea_���A/��*wg�#a��i't�j�,��Ug#�*�>�C��`��+�$f�*�����Vc��6V�E��|	T��C/�%�(����^�2�(��4`��g�B���D�m�=��6��"
�l`�Z����m��x��a2���)���fF5����fF7/��r�K�,�g�#}�N���)�`��kVn�_��O��1�>#�-�%������e���;���N�P�;V���"V��f;-���n��c��S,F��I����Ky���J���tu7uF�	�_����6�q_)8�6�o����S�mB����mn������W�m3������0��8GP>��@��[g��s��:'��oU���\,����C�'i��I�X����)�q��5sV	�;���\��<�����b��U��On��t�����������]��86��+k�o��i[3[J[(�b��9l���GYh��>tH�2���j�����&�8���&|*�2��	 #��O����o�(��.�oi�4�c�)����7A��I��x�{���/��Y:	��}���O8G(����l�<�g���G�>���A������UUv��2�.��b�.��ro�|�m�a~[8V�"�a�Z��vN\cZ��2�W����9�gB����L�</��	!��3a�g��g�)�x�z������3��]�-x&�}�'��<V��	d�LZ��(�]s������\[�����!��kC���w��!%'4���G����A��K�{�?G'�[���t�����N4/a��W�_��Geh����Yf��R��% ��{6��D���n�=^��z��W�S�A���{�@��#��������#.�����������C�����������3�]c���NJ���ua�ft������[b�T���A������������0��n�y���_����z��x��Wz_s���Mo�F8UG��jg�]!�#�:�3B����w@��j����:�|��:Sz�t�����E�N�AW����fu$�+oue��+Er��Yc����b����5V�Id��N��ur����Y#�4�U:�c������g���Q�������&/�,b�5Y/���:�,^��H%f��(�c�:����+��#�*�v���8f��R���-���1�I�P�l��N"[��}V<L�l1��o$���f��k�:9��!p����
���=HR����{~!����N�'v#�g�\�@�~u�a�GK!��~����G�����mnh]�<{Va�h��c����x�������������kG���3��������o��b�Q��.F���I�h�)���xcQ���q/�Gma����:~
j�K��w��������z?�S�k���}���A�����N�8�5_�tV����3���25���X��< �~u������35�!�|;���|�#��Y��O<A�/�= T�Y���= D?�'�����J����^z�� p���0�:W����<�>9��PS�M��=�>�g���<?�g-��nsO��Z��
`W��2�9+��������l-z������k�Q�����t����U3���U��(�� u� @�6w������%p��7���o��~qUN��
����B������c^�v�TaR�S+���{���bj�UA�sUU�����~����E����a��IU^����w��OM#|����U"���5�k������~;�Y�s���s��y����#]1�>
#�zQ�_�KT*�$�LoA�Op �W�������oQX�
����q�3i�s�QGa\!
#��\�lr���\������*�q��z������y���7>EV	�O.��O2���9��W�wE��^����p�+��/P'��N�f�:aW��X�"_�_���K���~q���ci���=���s�M���l�������}�y�<��:1Wx
�z?�J[yg-��{e��� ����)�B�K~�_lT|.K��P1�'��U�kT�����U*^��b<��mSv	u^l��.!03������E
endstream
endobj
120
0
obj
6862
endobj
121
0
obj
[
]
endobj
122
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
123
0
R
/Resources
124
0
R
/Annots
126
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
123
0
obj
<<
/Filter
/FlateDecode
/Length
125
0
R
>>
stream
x��]]���q����3�{~@����(�5�G��a�w�}X�)�ck6��G-5��U�}���3���9����")����{w<�����)M����������<m�?V�{���>�����/d���}7����������^�����������w1�����!�'����~��~�����O������S�~kW]U���'���!p>���?�:K������Nh rAg���t��a�m	�C.����17W>��!�t�e�:;@t��p����t��4�%������8wS:�k��N�[����t�n&?������������]d�.���8!f"�Z�q��y��7C�0s�!�+C ���p�U,L�����8��I\�� B���8ui�Fr�������Mz�����K����m�*���n���e���t����N��0�BO�w��U�$���tr}�7��Y���� �,�kg_!XNV�"i���Y����Y	)��_�:�s�9K�w��G�i�@s|Y����M�+l���f���,�Q%��q�D�p�F��,�K�����+Ygel� 8�!����[�d�78���4- bopne4:kdM��@��C
 �.�����4X+gmlT�(��m�X=}a�
���4Y	)��Ic�������i�I�@>$�a$}\�iO�?��)!~����Ek/7Q(%��e"��b����yG����)iz��0�o�-����R-�|��S�t�����i����peu$���rolt�a�w�Q��l��75H���� �9���2ZB>�(|ak��1��mQ8�p��pi���&F���9�u����F##"to���0<&�o����������,,�d\��~���Y������zw���q��Ex�:aq���NC�q�Y����<��S]���%��M}����!�j�!Wb:K���@F���J������u2���������|�����K�g������9Q��R�!}m|�^���D^}Y�_��AT�i�y�D!^�U_I� ������E}�v�����A�I��%���D����Y��%0�zt����	��YQ#�P����S"LjFtO����,J������_�E����<���\�����Y��
�$�Z>�QT�@�@yu$a����EJ� �,O����g�!7��h�������`���^lQ��� BI��Z����D
�g�����S�����AW^����/9��5(6���r:���t��t�R\�<�.���a��2��d\��9����(j�!����2c���4P�6�I0�����:�����a������
 �qU�����0�0��W����0(���bT�C
@HCp���g`XsO|����a}�����C��Z��g`���p���;Kc��J�(%���v��8��T����8
�!@�/w���"��y	>���K R�c��[�����F�dT����}�������fK�.���Pw/�K�K����FN��Gi07�����e*j��)	�C/����6�������A^�f^��;q�c�����	����{q�qYc���
xu���A\c���*�0c/�{aw�%������Y��"�cN/�0�b�nNVT��5�� <^FaiD�8&��eA��py��P^�����b./�&�w@�w1��)�6��+s"��j�0���h*��Q"R�zo^�]�x�|����,�+��� �����E�t��)����<�FwE�p��Lm\�$�����u����, X��;��aQ��T��}9��9Q��*��e���M���#��d�R��(*�Y�tu0�&*V��@j	p����Q�TT���,��Z��46��VA	)"})sb{�2uh/�*-����Sl/Q��4��IK�K__���X�u/��!��C
 rk���%#j�������u�j�����T2��Nw�#!P��E������ryw��B�ubw��V�SpHt��TG���:Q��fQ��@���+(����D��J������R�(XA�A��%P}e��"�D�:L�6-��W��l@a��FO�2-.|}r�u2�0amp��WW&Q)��|�q	L�s�(��WQ#{�����W'��4Q �6���K�k�:1��J��rEq	�|����5�=Q+�2I��% �o����h�(��$���?Z(I�;@�qQ��!����h���Q+��q���[Q/��*gB��[{�E���q\��~��E�w�bepU����I��`Q2��*f}1Ipi�J j��V����r�WI���R��) ��L�&:*
���,�_.���U�]x�����Vw�:=���&V4��E�����6:+
V�J����q�}�`ifQ����q�b���P��(K`.���p���&h��Bf	��F'���*THb���M=T�}�$�� ��DQ\��[34�(�Q��R>��t~n�����e��I����J��g��`-{��r�4����A)R�o:�
�����p���w���(J��?Z��j��jFQ�@�J������zQ/�8�(�K��������)�e�n7>ih�^��(����}��l�\6
����TpH�V�>P�m'*v��@z	0�e��p�qi����{5;HE�^L:BBY���y�m��s8Hc���}��@/n��9el8����m�������Nl����v�(��E�
��m~��i���$`�q�a�GN�'�ge
���b�`Z�$A	���0X)����c'�
 r�+M��9������2��T2��]`?�l������Y��@��$Ap���$�'fv�:+��:ZBc���m��O��6,&P*Y��>*�e���Z�F��A�EQ��	�K�<b�`���<i����\�X��?>Rm4X��T��=����Z�{����k��/���*�b�`���Lh#-��q�����S�9��=w:K��b���0�F��~��m��[;+��b���� ��,�(&v����i}{�m4VL�T��}�������.g�@��%-�H����cw�v��_������?��;���i�����}�����g��C����?��C������q�m��������>�����y�&���g�2&���g���K^�o@[]H�����UV����lm���������������X����K��n(i��{���_���������8|w��_�_�r������.]`}��\�3���:����������:�y�����Y���������&Y����|xX/����*��[���[�'kV��:�5�/��6�����7/+�����9<����h���A�e�����OS�6���K%�)�\:��Jb��a.X<���w���bgm��[�:��|u���R�~Z�?U�|4.U�����.��3�1o����T�_�1_�N����iK�o�Q9��y�
�=a��eX.R?�������~����>���-��.]��?U
C~[�����W��������������L0��}^�&���*V�g���HE����kd��X&���k]2�q]8.k��@�������T�����}|^��r��~-L�8]y�a��cz���=��7K�*k�����:��9��S���t
�g���B�0
$�5����!1��u�F��B}�5���A\Y����?T�KN�)?�$'����MN�F��d8]YB���<V���>����R��hB2���B���zT��*��(l��;�����/��e�e7��4�����%q9(�\;( ���r��|=���2�6�
�sC��tn��6��,�N��=�����E�s�[���B���F]? �Y� nxi�>/����
�q���3KE�F�Td8]�L^�9?V���K�>�U5�,j�i{�����kG�tRmC�\wg�pNa��T�����H�Sh?U/��,�����,�0^���C��\�`�����T?^�� 	C���0��	C�ns����f"�8����T���.i*��.��:��������,]�`!h�Q9�"%�B�RB����s�8���z���%�x�����_Uh}�U���_U�����v��6�����V�x�OE$���iJ������$�]�s��n�W��Q�OC���}zU��a���-�1}����
@���u$��v�l������M�:��{\��4v�v�u�cS�qzU��C�������cu`���0?�T�4i�����z���4i����>4i�AR?b&IC�v��u�����������B]�z4����������~�5
AW;X���BPenHI$U;Z�2A=lQ�G7��u�4��Q�h������}�B���*��}	l��am���,�
?�*s���
�W��P��	��k�� ��M�4j��f�����d!�2����!�jAW?X�~T�$��-�������6�����G;������j
�O�x������z�C�kz�Cet*�
sZw����6&>��u2Po6���`��P��^g�IJu6��T�����NWN��^��zB��I�>��R��kp��L�h���B��>iB2���B\?��[���lyv���&W�Yn2L��M��Ye��0O��<��O����~���e���n����P�$��Uo�����m�k�����0��gf��=��>S�.i�>I�s5�%:_f���k7[�D�0�5T����l�� n�i���0���2t���H��j�<d��!���$yH!n�<dx�0�ru��!�f�<�0��!����C*q�����K������K�.��G:TF�Y4��}�M��In
�OmY���CumI��3,!�ZC�?r_o�t�z����-;����y��k�9.��^�RUj��������D��������4 -?�w�\���YB����L���3��kW�s���o�0]��6w��OP+�>�K:s��
j����_?��_cny##��u����*q��Qgb��^;FUj��S�[���%��;�h$�:�"������
gMbQ��xg�A��`��v��������vm����U+k�]�B��;T����;
�2w|�������j
�*s��u��\��7+����7T�u��C���GU���H���BV�]��|�`a��C���G�j�]O��4��9��6�;�<�`m�h5Z����[&�1�u���$��������t&�d.G��3�0J�.���'f{�oB7��:4��wk#
H��' b��T�<��t��������h}����8�~V�v�g�s��g�� !�[��'A��T����q�������+�uw�_$B��B���:&��������_�Yu5��Q����J#yTu��e�O~D��aE�������`�V��_�j��?���j�
��"�H!�_A��H!�cKG
���5���������h����u�����\��T�[�e��B���|�I:r%������[*
����F�2���?{$8���:4h�����n���g��,���[����-���oD�]�uCq������H����Ob������;��#��e�e��{H%��t�0�������g��f����t�������l�xK7����$!�	I�vq��~c�i�����k�L�vXc�h]&�4�'�7��NC#W;X.2�h��4������H�P�pY:��]�X�%���w���Q�/P�T�_�of�
s����t�����G�		A����;4������E����F�0O�����k{������
q�����V������n��Q����(_���~z�>5j�g<tb��_�|cw�]���Gj���
@���)}����
@��W�]����l���[�Tj��;�A����_k�^�����k�6x$
���e0�1?m[���,i�g��ms������]�����u��jx����T�v��5��I�
������eG�������,��G��{m����eU��w��S�s��u���!u��~U���Wb�yf����XXv�Ori�>k^�c���`Kn���s�`���t�-:D�Z����e�9B4��{�Uj��W!���X�*��
Ck��v�0U�6F�05�h~�F�2i��B���c�kP�Nr�axOC�����[/#�T�s$������������=I�^I���:s��_�v�����������E\�u_�?������&kgf��-���c�%f�M�q��F�tb��K#~�X���_��Ad���Q���C��$���;����
�4�z���/3�Cs��
�C
s��?�nF��4D��>�Y���4�7����K��L�=�|�����Z���KL��A�>�R��<E��q�����u�e7���M�s�w-�(K!n���m��� �Q���InS���p��f���4)�>����d"��=)�
�?�D�^�L��L3���P�)�-�@�L��|7f"��}2�z�[�5��jb6O��Mpm��h�n�d
��Z�F�����2R!n)#Ir��%!��!	)�
�x�+����a�P��\r�B��<j�����j�.�g7G�F��R��zA��bn�j���s��j�
{�0�4�7����k�;N�x�w����}�F��qs��7�4�]�`�l0���s�k��/��]��G���.=�
�s���c��fS��'����S�����}�C�)�����T!vy<K#~�����T��������#l��N��W����Z�\a���b�����+f�q���2���Z��=������+�o����i1�
endstream
endobj
125
0
obj
6931
endobj
126
0
obj
[
]
endobj
127
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
128
0
R
/Resources
129
0
R
/Annots
131
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
128
0
obj
<<
/Filter
/FlateDecode
/Length
130
0
R
>>
stream
x��]K�$������E������P<u5�c�{������
�aLya���������-��1�����E0D�����������������?���qj��a�
~X������o����}\~!���M��5�8��m����w�������,,�������n$�A���.�����=}��~)�����S�x����ta&�Q0
�.HC2A���_L��8�.@��i:�	`����'�sO��	���G2h{:G��8pH&�����X�I�f�i��'��P��kg�	��s<��w]<���8��AdJ��{&HC4�n�W��]>�� NC2�5<��ws+`3�1�,�6�]�M��	6�!�x�A�
,B'x2�1�,�6�]�Ar�0$C����N�Y�)L�4$��e;��+� �,sAs�N�A�
,za�� �v��+� ��	z�~��w���`���H<��LBI�q�M&HC��7��.?	�V�e&��g�����f���v�zW`A0�6t��t���� =�	�!+O��%�
kt��~j���4�eV��[L��H��N����?��@������������Deb9K�������b������bhGr��F���� qL]Z��!�����\�}����a�{>,��`j��
���y���E�k��'�>�
R'0�!�@P,|F��2&v�S3�����A��43��P��v��,qV7���6P�]����F�v&`�1���pmFW��s&��.R�fxZK"U�J���:6m���[�,f�n�B�@-c�s�H����]����Pw�B�C+��	��:�!�x���IX�!�B�&�`h������~�?�'�q�I��`H&`��
�8p7��
{��%.U��{B����f^�����?��$��DC2����~��������!��SK���������<N�<	����������0EK���]�=�
XD��N��xy��\0$��n�<����#��=���#�\��O6�!�m�eaa���m�Z'{����Q������L���e2-���7��L���N�����A(;�����L����q��.����������uA��d�����H����K$2�I0'�,c�A�����>j`�'SwD�0f\3���{Zg����F7>Z�9 �\����s��w�P�0
�f��&3.�x�����5*��h	�W�2 G
�B���d~������7��P�4	
gRw%4|� ��
���A���%`�{���|�b��S��6#3������qC���<��Q��Py���3)A�\��<
�S��z�hZv����E��Ag��L���/!���S��T�G���:���7�!��a����:�����7*X��H
��`��7�{�q���IDiH&��d^ZV3���$P��Pw/����9���<p�vE*�T�pn�RBLu�����Aj�"s���r���D2;��/���p&�hJMn����e0�w"��b�PEK��!u�T4�b��A0��=���Do[�;�g��G�cG�	��Y���`�rlR>Q�8p�~E��T�0�	V
�R�G�S�(������s��gf��q�m�p��}�0�x��C3_��.��Y����6/�*��=�8pT�+�
�4�����	�
U#��iJ�m��� �<V4���80��	u�u$&�	��Q?R�S�)5a���Q_[�Ej|�b#�`H&�
?�<�c���!�O����",�C�U���<�&�I��`H&`�4�!����xB�� �h	85F���(V���\�N0��L���Z�W)|n)�T@�8�'�	z�+����6a��`M�DC2A��Ym)���c�y�� �h	��%2���*6B�	�dFM�?�x���9�)�T@�3n^Z������A�9N�9	�_�0f#�WL:Q�����-��p����?�D����dBz�E�fT*��m�6LC2AC�a���Cf�)/�h��z��Ry�I��L�Yu���
^*8T�!
Mg�t�n
R�f'����3����cE���6�x+��Z���(4�	D�C4����(��,�=4B�� �d�X/�e��+���*Z�M+a�=�RR�jE+X;4B��@��]��}���P��
�f��%���[!7hV6�!U��{BwI�����T-��Z�
�N��l&�
L#>5B�1�f���t�����}H-g�:[CfT��	lc����xL���������TZQ*�mlK�[�m���D���� T��Yle�bhE� p��s����=�f���=�E9;P��E��L�0�i/���B4�F=��Q�F��=t"G��U%Q�LP}"y�-NL��L��4�n��o�S���b�>#W��|Fn�Sw�L3��F� Y�Z~�J�-���A$�a	4�!�`�	4���'%R� {q�d�E���n��k{��H��:;gC2���t^�
�HZ�'A���!�`bI+.����6T�H[��I*�	87� }�������F0$|5�o�#�8JC2Arv6O�O:����4P�+�.��HW�9;d��d�
���)[T@�6�`H&h����AC]�Y�(�	�ET��Q�:;\C2��amu��w��Q$���c{��v�� R�]@�g�����	4$2�]@x� X0i�>y��(�JD\��3�\`��w������0>�D����5� 
9����t��#^0$4��]L����E�$���A�70QtX0��8�e�+Z��{�������T����� p�w].��q�I���8������Q���"U�����w�g��~*O>����ri�'r�QQ�.�	89�	���r��8���}�L�d����.����G�� �8e�C2��T�N�de:`�p��L�����E��r�x�2�4��JGc+V�]@�;����Q���c'��]@�Q5�[&��i� �	��2O�z�~>��N���������I����������o�����DN���>,�����wKD�?>��}������������3|?6�����w���������bG����1����������ly�?�������������_�i��}�
��oY����>�o�}����o������7�����o�_|�:�y��b������n���r��xmA;��?���r����q���\����s�������e#����@��xI���,
�WW`?/�\<4����g��u����E�`�*j��kV`�g0�f�lSkV����zkv��f������rmkV�}�Y~�yt�f�i��2`��~���T���=����u����=W]0�3�lOA�+��ZN����].��H��)�/�U�[��T��v��U��ik�7�~Ym���7?�p���_��g�A��{���	]���`@A��r�z��r�B�i����V`+��w���������f�
�]���I
����zR���*�w��I�T3���?�:<��
�''�@���^�I+�O������5D��H���Ep}���E
�CE�E|����/�E���@v��(�w���"��a�dM�~-�k��`���\.�hk^��/����q��N��Z�fW�Ypu^�W��m�������6�0�#�MC�-�f�m:�����,wh�n�s�?F��?.W]g�����8K���Ep�/2��}�|WYP_����|��/2��s>
���oS;�*����3�����X,������]�
����[R�K�uK
�cq^Ig����u����U8����4X_gW��pMNI<��.�G2�<��\�O����i�.�"
�<=�����K��]�9p��)�N�g\�����U
����NlO�0b{*�CiD~.�bXOc�����Z��*�>+�\�xPkV�k*��]u���@v�3(���k4�f��x]������
�cuWJl?��S�v�=�:c��k2v�=_]P�����X
��\h{��G����o���i�O�������7M�"wZ�]Y��Vh
�+����i
��W��Q��u����
�>���]�n
�eCRv�����[#��a�����������wWo{��`�g ;���|_����'���w��X�&����Ky,NLQ~.��ay��V���UE�_XX�6���p��X��.�q��	��|�QJ��_=z5`��p
��8����=���h�3��mO�)>���������(
j�E^����Yb}
��G'��\�[�2������cN�1T���*�S�m���KP����1�k��	\��T`�hg]����D�v�*rE=�u�^.�����|�
��5�?;}�C�Gu	���1L���{������sV2�E�����U��4���O��PB�f�w+��/�WL���\�B�5jMv�Q��Ze)�>���x6�>�R�]�b8��y6_d������\���M�(�K�b��W�.�m
p�I���}�'R�}�=�����H�8N=��.�'�E���@v����
���
���I�DY
�G\��`V���h�l��9rU��!z��4��3���Y��)�������,�z'�!W�b��;E-����
\^���4�B�X�\6-��)�.�`��`�g �C��)�/v�uZ�hU�=�_���������@�_KXk6��S
����Z��.�5���
���R��Cc�
���W�����;Uh�3��#Y����Y��d�HVA�y%/Z�
���0�
��&���s�F������Vi�-�U�=�j��[���B~(o|���U�[����}y�D�
�K�L��z��#}h>�ST�vi[����^
�����j{r��i��+�����*������4�x��Z��j�.+�\��j�>Y��.�5��q������;�j�.���S���}{��)�>�gW������Y��d��TA���c��`�����.5d
��|�n���������I��o�Mv����m����^�_����h�kN���o�k���i������=�vqc�B7w��yq��[�Nq�~�	�W��W�<v�	������<e}�9
=��
@�������gPxm����*�<����_�����7���>�����8v������F��?b�
��6/�P�U��Iy�������w���m&���i�|h�Y�1|�,����=�+����
u�n)qyM���z7�^�7��c����#H�C��x�\����D�z-�`��]��R �	�>�W�S�x�O�	����'�_���Z�f��>�R�������!>A�z'���S���C��TT(qU�������\{^��|���+�O[���������=���*���>m�r�������y��GKF{VA�����)�b3|�bo����U\}��f
h�C��#(�Y��w���S(�^���]#�E\����jj�t�3^������"��v�&�-�uz�x�\oi��h�`x����)��}�`A���f�����{������h0�Kt�������g�
�u;��R�w�:��ww�T���:�^+�����Xe�����a;X���e���f�,1��l����,1�-sp~�����9���*��#Z7�k^�B�n��`C�G��BG��v_u��\��zL��Ng��6&�
>j]��=r�����Y�a�f?�T�0"��K[Q���������E�7�����M`��nXC�Z>u�:�~|�z�a�%���.�q���X���f�1�M���j�����m�W���������^P�C��.��SU]B��@����Q���`j�
�CE����}_Q���i��TUj�a��VOS���i;��i����CE��2L����\��~?�����+�n��b�+Ma+1u�j�������;S7.�����������U����^+4u�g
v���k��0���������������
���4~���T���0U����vb�&zE�(6Kh��t%f�;���=^666��C������ak��G���k�uY05��<��9���{�1x�e+����2��w���R�
x}����J��V���L��^��K��`+��):�l��*�`�V��+R��p��~�����fE
endstream
endobj
130
0
obj
6268
endobj
131
0
obj
[
]
endobj
132
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
133
0
R
/Resources
134
0
R
/Annots
136
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
133
0
obj
<<
/Filter
/FlateDecode
/Length
135
0
R
>>
stream
x��][�,9���U���������L���+V#�<xZ�"4�
���'�N�����j�{��������#3�v���o�����������������y��#�b7oc�����G���/�������f�a����b�����l�q�Xj��������nD�A0����[�������_�����o�`��<Kgf$�1
&��?�!� ����Og��g���v����L�������N)l&Cn��A�,���d�m�M7���A�0$lC8�~��ws�0$lCn��A�l�M��f�0���������	�!7y��w}G`3�6��	6�!;�M04m��e&�!��7��.c�>^��7����<�L�
��3�(A`�f2Ar��zg`A`����}���3� ����������X�����^�;��w�!��`��N:�����\�!�������i<��N��aH&Cn�t���x��!��a�-������8�6�!+O�f�q��M�������_Mf����r�5�{�05����=��_'��o33��1!@YL��cA�OH�����'|��E3��1�!4�>F��Q�Y�MN��pL����X�&CF���\'h���>0��Sr/`3$#	�������g�
���uG�@H��$�����!���8�9����&Mf�c���gi�	�<����W�%b��]�r;O�=Y2�)�1%C2A�8��s�-3&�R;��G�	0i�k����>��)9�6�g�l�	dr�FG�5�������L�4�|3��p�y���`H&h��z��f��������3�i��7���}{�����3I���L���|���Z�J`�J���XI���q?2��L�7d�PI�L�,�_�i�c�v���/S��#�cG��20�`�������j/`�#���,��X�la$A�H�f�m
��^�VF�$V�o��n�fD�=a���b[gI���M{�	�Hk'K�c.�P
��#\50����;���l=bx���<;��%
�w�4N0��Q���I��X��3d���M��=��>0Ron3R���$xx��8�����h�'I�
�y��vrs�S��[���r�[<���tR�X�G�u���7-
��:�422BC|���.���C �H��8�x���cd�]@I�2G���<�8m���X�J��S�g�C�$ � ������lF��o���!��"&+	��R���< �0$���w(m�n|S-�v$�t�)7#��O�4�bS�i={�vt�N`#��� %G��m��;�6�#5�d������m�bm�f�� �h$�Q�}���d�~0��h����H�d���,�^�dz`%CG�H�6M�Q$�R[�����T�@�����8��<��J�<"'���sm�	�+u��3A�	�{��Z�'d$LV j��O�\��aaH&H���gh���X��%���GK!P�D4<�#2.+	R�p#E�U8�n���$@���$�4uOi�7�&��rA�����f��iN��� ���yb`�jM�"I�	��q�t�J�'���(��r��n� h�u%>5G� AO�][Kt�y�@�F�;.PW �rPy��	J�h{cA"E�`�������=bP!���=)�s�I�n�cO�'���d�������^��;��6�m���H�� Ii�R�G�g��&�.0��L�R�Sy�j�H�]3�r�����cRq
>;[@�v"�A0��6����#�w�3)os���:?�=R9fR��������Gja/@w�}�0$�!����c'��	6��G,��-"50�A����>`I�������L�������	���|/���M�.�HUK�_��!�����UGj` A��!� %G��_G�*$[��k=,���L`��v��D�� �c���Nfl$��;K��Hb%A:�Gc#P��Jjq`�!� �~x���D
t/����-#	�[���'��T�6��Gjr�%�������cZ��A��Im� �I)�D�{�#�G����`H&H�7����^�����"�a��*�D}�r8�ZA�y>��@�I���%Ep&�9������^����$A�m����4���7�XI�H���9h�!�oW�$�G���H�x�)vsAw��8'R��)8sA��/���RI���i�H����{����HR{F*+	R��S) ��%�g?�b3�������^������L�<�A��rXR�������eR���|�`�Bo�I{K�#/@3���� ���-���<���#(����$R*yJ���2C2A���!>�d�2r,�I@�\���bR'E�0$���V������$
�H��X�aQ�d�'����8���6�����D��V��$H�]��������~�r��7s��@
��#<&+	����I �7�IC\|��|���x�B�d����(C2����=�/�8KR6��:�'\�D�8Md�a�d����J�bR�E� ��,����,X����y	�����mW�1P��H����r,���{jK�3/@�
)�2A�QO���I��� �����G*6����-��sl�DW��V��ZR�y�xOj�\���f��&���Ui�
�����r��J���l���z��q����{R�����a�9����L=��-�iy��}��n����+�����f���������������x��������]c��o�����K~�}?�>���=�|l������������������?�b����Yl���>�TM��x��������/���������
Z~����p�:g?/��7��/6o���_�����i������?�_������p����a~�������������������0��z�q6��{�w/�ov�E�m3�+|w������w�G�s+�_�?��R�������~��^�l��;���:.je��O��
�������);X������
���7<�z[����2��o^�V���l��>-.�uQ�U�l�i��7��Z��u��2ew�)�9[���������b��B����ij#��=�`�b���������t������i�|����es�i��/��Bp*3�aX��M���J�����u�\��������������]{OKk<>;-r��`�<CQ�v����/�g����j��:�}Yn���1\�RO9~vOIy�p��7"���
O�6�pS=/�R�#]3M��s���B�v��7����2�B!e^��Z�`��A����F�|��JC�����qU�9X.�O�����=��������O\�[*�?��|�	��d��}X�US�/7�4]+OM���h:8x�����ZX���-1�Y<�k,���J���w]`0�n����a_�x��9�����r��9����\]K��JE��W�Y�7��u��
����n����B�-r�	�m��^�[��k>B��}����R�V��a'���`Cu�R�\��&0���r^�CP��O�r�.zZ��R�����<�vO�~$����K�t	�m�E����j��`��<�R��.6��.�s�n8��XS{E���^��W�����������<�@�f���xy���z�vy����8�6��aqL�'����.�(,�a*��X{�)�$�#�ky��6���]��Y��X�����4'����'����z�a`��]������F`X[P�59��MS����^#��&�G�a��.I;(�T����8�i|}-}m�^���Yf����������~_kU������W^!���N��F;�^�=�
���C�X��V��e��������k�v(9v=S��}������i�M�?��K�"��R^���SsS��m����Vt[��V�������Jw~a�+\yq���5�&m�(�s�\V_�����}V�As����d'�,*���m�l�93f:k�]�e;3��a��T�v����!���gI���%��M3������������]����mj���G�/����vc�8����n(L��h�������:�cQ+d}�A��'Y���]��9%r��j��
�.;J����4���Q���>A���>V�!!.�������W��*d}��7/��O���
Y{��k��V�fpYs���m1�gBw-�N���0�Z�-��6n�-�-���r�ES0%~������<Q�/u<����y��<�������n
	j�.���b�^�'l}��piv���m����v������v���X��{���k�b��������,r�A�$W��b�%��������m��{+I?/��|�\�|��.E��:��Y�[���]�2��g:/���N���f���F������4��i~\W�K{y�������W�������m9_��.�;EE���;���W
|�������m���?��W���owT�Xs��>����{>4b���7��	<����MO�X���e*�pQ?�F�A����BK��5��N��1����oy`�V��_];�\�c+e����*^�B�Y?��w1���
��
�5z���N�Z|�v�h�A
;euS�|��J�����6]�g0��v���
�?�/xKJv��|A�����z��P����R!���W7�b������W������x��N��M���55m���P����(=�t��O��Y\�?�N�9x_���Jx��V�#��t��V���HZ�z��=����4=���!M�B/A��r�n�O8B@��n�uk��!9���������P����[�����X�\�=�����~T�x�B���������h�v3��}��v������<��ke�j����ZTe;��s��\�Ef��,3��%�0�6+h�����7����(O���
�Q�<p��{x��L�J�b�������j��pt��8�Z�^9�;A����YTB�A2��>�g0K6"q�'EvdW����9��1�s��6v�����n��bU�Mm��1�g$�r���SQ���WxU��Y������Y�)��UH�l����/1������/1��G�'~�A>i��c��b�g~%~IPv��C~�����jD���%[�|������l������b���U�d��.w�!��'S$���0�J�����q�'��J������
���*{�����N�v�V���$m�5��o�+��d@��$�
��:t�v��������&�<9������q�'����7���ph��7	�.=)����N|����U3��X����+D�J���r�/���b]�%��K�p���]y�Wj8�Z}6�xU9;N	U�a<r�,%���>mWH���$qI�NJ?���:)DG����9`���.I���R7W�u������z0�>�����bk^�K�Q�Ikd�����%�(�UA^�=	�B?��.�!!���'�)���-['�r�'���o�R��t�/d���[��b�B��x�h��ZmkC�%���\�52�u�-��z�[#�}�l���(�Dme��[��N4�jC���u)j������R{�o��=�5N��Z�3�]#�3���9�\��&���>m/��n�E�C�I�h.�D_;K:��A,x�h��a�h^,q���Y#])�3��Frb�,�9�^��FA'����
�5�}��p��_�'�AW��Z"��'����X��V�n�����f��U�Uu�X�����wd��Nj|�^��s��.���,qh*]�F�N��!��c��/��J:�[#�\�O���\����N	�<��=�������;��k$���Xto�p��*s^'�s�9��b���c?)>Q����#���������")A����/]��/����y��'�^����')������q�#��!�p�7+���.0����vF��l���\��,��=����<���g���Sk;����mSl�K�C������7���	m;t���������^�������8.���_]C��.�'|�Q
r�5���8-�G���1�GE{1q�,����E��OM,��L�����n�FqH�Y�(N�.����fS6���5�I��MP���h�x)� CP'c�k�ou�Q�V����u3qY�Iq�p�����K]�t�Uu�
N���vhL������8b��z����z������R6�����"%��8x���6�k�
k_5J�X�cq���zc���T��b���]�:���*�����o��NJ�x�gw�H�s�%�<����JR�����#s+I?�-P�)GP%�S���`�]�cc��2��J���|�q�.���tZ��p�G^*��6�*�������5{�b~�����S�B��^|��
\�
z�����q��3by�
!��
!Z�[��}�>���A!��v��i� s��H�q��s�l��!Z��ak��C4CP'D�F&��������
endstream
endobj
135
0
obj
6602
endobj
136
0
obj
[
]
endobj
137
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
138
0
R
/Resources
139
0
R
/Annots
141
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
138
0
obj
<<
/Filter
/FlateDecode
/Length
140
0
R
>>
stream
x��]��$�q���7������R���W��@P���0�IEC
$�����f��������,;��xn�{D���~6������8,��_�;7���}?,��s����cw|�~]�A�[�n���k�q�m�.(�k��Y��0�ag�����{��������0���������t�����������m��������~@�4N�!��!:����������62���\���c<������<�!r�����:��!�x�f�<������-7��<�g�y9�a�q��f��OS��2os`�}�������w ������rfS32r��yG���r1���m?.g<����va��]�'��.� �P�?n�Q�p?��� ����uA8��
�	�\��T�n"�s���;A=N=-�����!;�/0��eb�����;p�	�}�2��.�3���xC�i$��-�p:L������8)j�	����:�S����b���63�$��mn�Z'n0U����i�|�P<Q�!1�!��S�p���������$$�=87\�����@-C��9A�r
B�D���dB��5�����l�r�O��������,�q�%f2�q�u�u2$&8$3pj�������8��x�;
�h���r#�`��� x�H�C����@Fun Y2���Y,�a����
�q����!�AR�e���v�����5��i��!�92
�hu�t��8.p6��Gu��L��� ��Zd����`�:�]#;3�|c@D�8	�dHLpHf��Tq�Dq��6P���e���������������@�/%��@�8���C2�qH>��m�
4P�IP'������J�D
�`���)ZN�)�%`�7k����e��8���NE(�z��{
B��04�����2����FT�2n+4j	�����!���#�V��,g�o?.1c��$J�y�K�0DQ�������:��*��$
b�C�on�����.*TOT�@�(
�� N���f�xA��;�� ����fn�w�jN�(n�j#Q4������O2l:$3�3����R�����(�sCk���:~��5�):�-�����~��`H\pHf��(��^Z�D+`7P�Q������W*6�O4�@-a��{�p��`H$pHf��K�����n�i
=�H
T�IT����������l��r��`�h8������Q_�"8$3p�k���n��E�&V���Y9�������en���~Z.�f��E���*�������d�q��k^O���U��Q`^���$����G�D�
����C2�N��,J��������j��
U1��x��{"3*;��FH�8�m�Q�Y(�7B���Z+��Y�8.W�'[L�
?��
u3C���[3m�'��#/��u�;^8"3F^'*�����D�ad;^6"�83IL�L��|�C
�������06����p�Ip'���T���/W68n�dEb0��BU�S���qhy�LpHf�*�r�wA��l���������X���&����@0��j[��W���H
�����c�kh�N���%sn��W��@9^>*v�����a��q��kaH���v�����	7�2���idC�����<e�c�3�C'��P� �eV=
Y�ve5J�	�q��2p�[����wBq\��;_���B@/��{!5�GZ����;�"��y��������%lT���H\|��Y;_L�������
�K��a���Q�A�>Bu\���t�HC�LbTp�'n0���e'���y�v}��3_�C������/������F�h(�
�
7
�u2��������&�����@��j�
|� ����_/������AT��AP��#�Q_Q�� ���@�o�BBU�n�Z��63�|�S/hP3Q�+Q���`��XK�e0���b�(�sg&�����������C/�f���d�����vk�u3p�aern��zO_Q��3P�aurn��W�3�

�(�w��BUrn���'oPd�'g�g2DQ�����7�F�eE+�C2�d4	��<�5S:��u7��V���!�M^�'�|�(��m���3(����a������@����T+p'C��53��%���0��N�HpHfh���m[��U+0�!��S_lTOT���N����*W�w�z�6��@O�HpHf����iw�z�m'��D�������XNM�$���2��NT�@
\1����,��������+����E3+`A*Q(��nZ(�C2���_k)�V�xn0U�7p+�.P���t�pq�2jm��7�����wU�e�x�
k�z��������k���S�p��]lz2�5�+�f`�C2����cGQ5�"x���0�C��U������+Anl��l�3p'C����������n��w�Z�
-�aoL����3E���oN����s��j97n�i��(�wUw�qn��q�2�%�f �(������T������(�sg�����D�����(�s��
HR���}$�L�AT���Y���3�����D����H�
����gQ��"����43�C�y
�QN���\�0��	�%�e�N��K����66����f�
T� J������lO�i��	r�X���-Cc��i�u�"D���%e�D��87mg���5�T
A����TQC/������� ����L�^���!����Z	�����X��
�{�|�@W�xO�L�@
�^������mw>.0�T�
t���������;N�;`�.X��Ap�(E��`�D��`�B���o^`nZQ+�8����aL�s�:�i�
�c���y7Q�#u2��`����.47Q�3pY�20�#D�y��0�jXQ��8i�I_O|q�R���6<K$��!��c@�8	��@�_b(��K�L�
�9�(y���-��#(#.q�t?.������@��EQ�V ��ZjTS�� ���Q�."���[.Q=�f0��
��o,�� J��@���xn�4'�T���N��s����B<7��*��ss��������������7�?��7?_>k>�3y���;��%����������������������������w�|l���;{:���>��%-�Gr-*�����;}���e:�g[F�_�.����������-*}�#���OA��������>����?��?���7�_�������~�6����~��h`?�^N�<=��/�����?��Cz��g���;},V��"������[�O������}��������\����K�|Y�a��2�n�pqn
�U�(��<
������� x���_���o��`�Y�.���i�OO�<�b���������(��<��{�yQ����[Rh�������6g��U����������b������X�P���<��;|+t)��xY�5Z<.eWu*
��2��3I���f�N�sj����������p>�q������t��r�������_Q�h'������/�Ud�����Qa���K j����5��(����|,��e���J�����~����<���@^)�N�{�%���l�>dk�n![��]��I����K�y	��y.���:���R�-y��I_PI���������I���lg`���09���n![�^����K�V����U�O�
��V�;���e3�B��������W�/��n�)�)�5�.��-h��:��g�v���9�������Z�w���'���������2o���s��&XgQ�������a�&-�23c�-�g��%���@9/�j)�u�=mS����%��`l���
�5���f����]���c��m�]�&j�r���`��V�eY�����_J=+��|�I����P4��4$�47�k�F���X^G�_�����?��4�������j<-�������1i�5�>��t���������c�f������
�I�9��.}E
�-M�0�t��*;�~�|n�y)���S��c��u=�e������V�klyA�����i�{�V6Oi��a,����|�Wt�yEC��tMW�bY��k��56H�tp�tMvh�������\�p��wK�h�tMvJ�����U����H�Z}��6sL
�S�� �n��p7M�2l��5��C��J���]=�7�t�*OEZH�5E���eL�����]KuI����������gT`�26y��
=#�����qp_�&�+�}������*U����Wo&����J������M`�n������E�KqhcsQA~��Bh��@�Y�G�����(��*�{yiFs�;�e��.�,@vc�9��tau	�W��"� �_���(���hmk���Z�5X���!�����L������,M���&�$8��<��o]p����t`���{�&bp��K���V��M����ZQ����A���<g�Q��R��LFU����>�s���?V�J�3T���*5p�>��N��]Q���t��o���IG��*�5l�(��:Eyy������(�`�-P���	s,�&NQ^����n%�U��m��U�����>��K�89o�%sQ��)�5��/h4�+�{��I�
Q��*J��+��v���H����U�����sr(��aq(�����:(���AY�u
�
�[��`�2������uM���M
=	�*xy��A����Pz�M�����
�V|E���"�l��v��
��7�Y���o&���Z�s���`���s]_<���Jos���Ko���>���N����e�~������~�;�dT��dTD���v��
[~�:��pYSy��%��b��HNWE.�]��ir���[W�;]S��Z�H��T�4�J;�L�b�wJ�CCV�BW�:��V�K+&pF���Z��>��wH
�c������{N��<���(�7�T/����
�����;$�oR���s�������6[\�X\6[��RN��a)������M�&���LE���Xh��@�<S�cs	��'!s�9=W���3d.�����t.��&��b��
.�~��.���x6��h}�}R?�@��!��)��Y�
U��,��"W��+2��/C���W� �W���)������Q������G������+c��[���^Q��jv�*�
���4��b=������|��~�WZ�z�>��]�\,p��30��3L��^;cU���	�I�?�o^��PA~�x&���r�Ih�Q|O>NBU������c���I�p�)<�/�c�������Jk��Z�'�*�5�tus���[���NwWe�tw�f����mH�UO����8�
��G^���d��^�VUh���?�oY�XE~�����!����d��O����E����,
�T����$*+���)8�/e��E�*�k�^�����5X�������U���xK�z�b�Q���zx<���$�8pt�3LM���O���n�_5h��Z.���4T����0+���U6

=�������p���*tE��w�(��x �z
�k�p�;<��@��
�O�U�}x���
:�]�`s�U6�U��{V�W�k�]�t�^K�t���0�n���P����
�S�� W]vtr8�ZC����4T���d�44�������{Y�c��,'!W9�����K�����o,���Dc���DCt��UM��.�"���N\��Z<3��v�K�
�x����|<���W�8�+6������2���:w���J ��4��}��}���r[���vi�'��5w���A��W#~I.K�K�k���~I�~:�0.m��_����uj��$~IC~���~IW������_jS7$G�����������Ss�2�,M,�W�Sq����.����ju~����gp���T�_j�L����|�U����~�Ul�
���3]��{n�T���si3�3
�b�4�i�;l�v��L.u���z����+)�>�����)����c^I�^���n�J����q
�);S������#^IU���j�������F��k|�N��,M�r��Y�zu�?��v��)�AWwk���������T|5���y��%*���8�Z���	s\�&k��q�<�;.�Ji�u��G��@��=��0����j�_|����Z�j96��#�������U��{x�\T����R������M�81U�n��=Z/}{�'����W�}-���d���:$v
��;�-M���&t���<��0��4��vl���������v9����wU�.����� �V����E;�����h�>�W���ul.*�)�V����+w��X�A�,sX��4��P�A�:W�L�Z����
�O	�!{�E���H^1���p��yS�H�j��iR�=����=�����
�C,���9V��)�5�=�W�)X��+�5t�X�jR�._:��K��Et�~�"Z;�����\��@��x6��ZzX��� ��^;���?q��MvS�����I��4�*�NOX��+��D�
\���{�\��	��kE{��E
;��Y��+^F����R��b25��m��uxst�������w�^k�5K�a��c���6���a���%���1���w�_�~r����/��T%&�7��h��+5X�G�Z��	��_��{����py�n`��yoW����r�z�K�5����X���m��Y���h�Q�3
�K10���n�*5h_dW�����I%�bU�f7��\%g��Us����+<��YG��o��5���p�Uvx��\\����>t���P�>�R`}<�\�u���&�u)���U�m�"3���L}��\z+�����%)�l
p�m���T��G��gS��=o�)������i��/5$~���Cp��4�i�.7���L�����Y2��A���u��O��+�X�Y��H���{b�[�9��	�*{��:g����[��������S�k\�Z88"��
�^���-��{���4����0s���3��n�a5h�Cvi���yK���5
�2��K��W���.;b-h���@W��%�Q&�������I9
endstream
endobj
140
0
obj
7257
endobj
141
0
obj
[
]
endobj
142
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
143
0
R
/Resources
144
0
R
/Annots
146
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
143
0
obj
<<
/Filter
/FlateDecode
/Length
145
0
R
>>
stream
x��]��$�q��P7��k���R���W��0�0��0��� �(���Wln��ef�[� �������a�f�D���}o����b�a��������y���+�a17�c�����+�q������6�{���<��m�������0�q��������~���_��~�4v�ao >����������r����s��!�w�.����i�
�IfHM�|�'�g�n@��	������jLy�	S�<�!5�x�q$���8J2tM�R�������}(�`���G��<�<	}��!\�mo����a�lf8��{�F�9�d��w{fHM4�.N�'�|0��A��d�kx�F
l2���a3Cjr�go����a�Of8����}(�`����G��<��S@I��m8lfHM����>X0���d������>X0�-sCw�O�F
,&���a���7�P`�{�&�x�6�v���!59w�F$�
�������^���.	�d����a���f9�q��!�?Ma�����F�X����L��O�3���.>�o+w65## ����u���3��.��k�i���bh�]�0q�C���BB��t�i6�M�fZ���:�r�7{<b� ���1�K���.��m����9tC����>�����eP�^�:�9z�B�1
�c+t�c�B/��>���(u��i��A�D���a��
���7�Q�<F!r��&�A�0�ez���:6B���qj73�&y?�v��5F��	gZ'j�����)N=�������>�a{��������G>#Sj��]�3�:p�Q�z-|��#�N�H��c2����J�x���o�JM2'����u8��.���������@�Q�{~1�q�������x�C��e`�;�B�:n�+o3�
�Q�  Q���Q�n�OO�8|���C'��of:�mR���_'$:!9pE�0���R N��b'
h��
_�aH3 G��S"���)�e�zJ��|�TB82�I0'C"�&��3�k"�<%3f�S�u�h���
�0�	���g���DM2����6�O�����Y)���$����\�!�l����>@��d��+�Ila����wy�C��a����[�Rle����{y%�L�If�})M��yo�����h�u���q�����If`�t���'�������L�20�q���f�U��'�N���gG���?�A�
����6����u��44����-���'��8	n'��0"��z9qc�����*Z�=m���^���nN��7�,�;.�M2�"���rD	������s��Jxm�<B&tbF]E	�D�20��(-h<O[(�#%P����&h��M�x��u�j��<�J~yF���a�XaKR^�G����D���D�0p����z�b�d�>H����dq�$�XB�1O����|�t�

+����H�����b��e`�j(i�g����2�!��=70f������D)�(�c/*��X5��iIK2`T���=�-#��!qbS�s�>DG���`�;�pf�7N$'7~�&z/�b/��VB���������(��E@�h�W�(�D���7�
��I�Af����~b"!�b"!�b� 7������KP�IPP�#D��K��e��i�Am���b������If`���7((��(W������$�q�����L�Y����V��*��=�{�y���b	���0M2��6�G�p�����0P��fh����9A�Gs�F���"RE����M����M2�m��O��Zpxi}~"��W��!�5�r6=�kkB�=/���z�Fr�\^]�h8��1{���������3_^}�����a��=�jZ1�����k�'��#���d��������}��kB-�`�j>��X'����3�a���b����K-�a���@�8-�A��0�r���
�/�	u2D^�+N}�Rc����0E���/gH�ub�T#7I5������6z�i�T���o
�v_����IfhfV�^��J��$���i�6tdkH=;���
C#��YA�G�Ok��4�HM2��Y���R9}�A4>�p��]:g���)l�M/�m�dB3�*��"Q��yS6�|�L�T�Vn�k�F��B��pf�J�gUw���������K�H�z�=��-'??O�R��������33pbc�,��!�� �z!pfX�5d����L#5�+�\"S�����3�'�i�r�f;�B�� �i�y�@�NH=uBj\7�����NY(����b�)rE�0��@2M�Sm�>�WL-�dm@8[���bG1]p��0�����-�c/��[o�(�S��4�'���r�n�eK������J����u�
-�G{1����@���)ZN}q�j%f��>p��2�����j �(����Q����U��p;�/�QaQ�#Q��y���gT�� 5
,��0�j<7���v��$*��@�Da�V5�p�������22E� ��������Z���4��<�������x7�b	Nm�&�!5�o���=%jc$O�~Q
gA�C��c	zJ��@��yKM2���~�)DQ��*zifHM��QT����}���\���7�&w���XM2Ch� W�,$�� O�^�����_N��sD����Ch��]��Z�L�@N+C�$3��<����8DVq&�d���IW�,P���4D�h8����E���}��"�
�}���d7<Cj��7�7"_�
�'N-Cs9u�G1���3�A<7�_3�0b��
��P���|��Kn��w%�D��
�5=h�'W:�w	pN�d���x�T�H'A�����*r%ur<E*2rE���������������a(�TB�8	�d�D`�
�����O�j����B��0|e�F�}
B�T�C��0�[�c�=��&�;�r��~��Qdy~:������/l
G��;v���&�!597�<�Q���,�
C�� �/l���d0g��S�� ���c+F��@of&�eh�H���������F�������� ���,�5��9��������3|�5�}|����c����a�/���g�~\�!��������������b���B��]������t�����~Y���������������_��'�n���w�/�����|l�}���?���������}�����|���3�OE�s�1����?���X�s����"����?-�����%K��E��m����������3������t�����������nX���zz)^��
�R|��y����p���oO�U{�]��Q�S�W�{��>�m~��>W-�������P��x�z���k�{�z�o��K2s��K����a�o�.�k
�`��qZ���2�~�0Zk����/t}v�T/4��r!��W/,1�7/4���k�xa���h�A��>����kq��[J������)���Z�}C�>�]���B��n��$���M�/�z9�EC��R_l�F�O��tK��a���i}��x��*B���?qSW`|����H�Q��]���`�"Y~rI�--�&����E���X>e��c�
�&�FoV�_���7+�/���wc����@�LPi�.%��V��Q�3�pHy]�`�g�Q����5C��������C��k^^�w�+J���C�o��au����<�����/�����45M0���F�
D��\����w���\���R$�����\Xs���W�M���VW�Y`5���������b�����5s	$_R��N�������?/�u�������D�K$���#�\���H�?VA�|E���We�,��x��HC~(�<^"�t8�\�za��K���!
��<�a��_�������p�Y�:��`}"�\�����F6��|��F6���'�|��E6��x��k�b�l
p��
�l*�C���O���?Uo����q�K�.w����.����������,�������U0�C
�[y�����l`��_=�5���G���I��X��)�^sd��*���f���M�) Ir��!�~�IC�(�H�P��b��2/��!��x���������7
���:�u�����0�� ?�_3��W����x����\��;�����l���:j�X���x�\=e�W����K��f�>����w�o����fu���zx�\�u��u7��A���p��)�5e:�=����X������j`��y���)��4�>��8A�G6���m�;��azU���H�����pu���T��Z�h�*�
�C��Q����$Z(�.���R��������}�j�����}�^����@v�>������z���m
��<	 ��W���F�.Z5�����
�1������
�R���\]�������\�{��W���2���U������S|wR�S�}|���{U`'�3.���\�`�� .����)�G��)���T����S������N�-T���*�SUmhR�P���h�9�����9��b/2��q��O����)s�^d>������y������rL�����c������C�����>��.Kh,R�?�i��&���
���� ��Ur�\5g�����2�c4�?U%�i�
]����d\s����?9D&�t�d_�
^i���]
�O�3���.���V*�K�D-�n��0�*��f�\��,��<����Wd0�V�+6U���{�$�����:S���f$qH�v�Cpu��+���qH~s��|E�q�B���4��,���sq��/2Q�e��\�"|a�<5o"����OzM��^�bB�f�t^�
�>���NC��W����4X���*�%q	���7��=��9������C�R�_���l���K��5�������������R�e�Gv����o��Qge�/��7_�M�;F�3�X=�x��L���3U��CS;����j�`����
��r_'��\�B��U��r�kEU����&4`�	����C����,���[mU��+P�{
�O�����
g-��)�N���F}����}��Hal@��K��v��Sh��V���Wi�a;����h��[\c���+�:
��Z��a�Qx��D
��� ���,�tX������Cw�����x����9p^�=9����0_4���������.�n��.��+Rt��*pi���p�=��#��3�*�.{05���P��]�`�m��0�*���<4`���:p�+�)*_=��w3#�=����O&U`�?�aj�0�*�5#+��e��xM����pc���y>W�@�lIU��7���1�L���;���k��e�]��_��-�����M3<[vS���h�ke���?�����1���r��]����������6X�i1/Z����c}MV���y���������V���9v_��;�L�v���(���[W�J�vqS�*�7�*���$fL����M�/��u��g��y�!�[
�u{`��V�~�����b�����u
|��~��|���6�+6�����5/e'���KW����-~��j�~�����
���V0,���4tw�
��h����>VI��Ul�A;G���m��8�.�T11B"����uC��}Y���S�"�r+��l�v��|��1�U�_+�I3�U��+^��|6C�������gM�
������/i)��gSj����<o�S���J��E��\[��K}�W������������<@A�*�5���
��
�:b����S�Q�j��@�:�������-*�����j����
�������s*�G���[��MG��
7�K�^6�@�aj_���{�K�;���r+����\��n������U/O_���~T��R��
����a�������}��������L�.����ym���`��o��e�.����
x���k��r��KG��
w���x!4�@�>���S
�|���`[�^���\},�G��Z����8;��<����E����t[�����h5D�AR�~�W�}���	Y�����5S���v�#��?N��!��8CC�����+�yi6��W:��"�q�f�0�+�.s��������t5���w�0�T�j���T�?o��)?�AS%x��GS���]��c�<lW_XR�R���w�
`�#�:p�~�a��9i�aVp+��2GW�ko���ouo�
W(SE����5/��nH�T�Y�X�������*p���{�X�~�8UGS��G�BttL���h���:A�{�cZ_��������9F`
��|�5���0�^1���._O�[@���k
�cRM�}<���}Bz�_��gNg�������B�
endstream
endobj
145
0
obj
6463
endobj
146
0
obj
[
]
endobj
147
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
148
0
R
/Resources
149
0
R
/Annots
151
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
148
0
obj
<<
/Filter
/FlateDecode
/Length
150
0
R
>>
stream
x��][���q�G�������<�(�_�p� 9���a���a�q`#@��uc�X�R���9����z���Iu#)����{w^��l��;�������wo�������8��w��U�q������w���Mn����g�0��wf����mVf���������v�"?09���~�����e����}��
�o���`&��?
�@���v��s;v����
�!p��u�C��m�C���>ZB ��9���`h#�M��@rdg���0��N�C���>Z�!��9���`h��1)�}ggt
�`h#0u���!���}0�K�@7^���`h�-�}�����`�yB[�bl���,;0u�~����!G��)�������D��D t$�@rdg���0�0S����
�`h�$���p9�l�>Z<	3%`/��m�C@ a��w�A�L]?f�C���>Z����������C��;���L>���C�����l�c���@�����us��T������a��;���ZS.��W}������a�����������ePUF�Z��G�������vX~
��c�������n�s��N�oT�	�mG3W�������H�u��_���%���|n�{b��z
>#�W�v���@�I���!��F������leh$�j�7��Pfz�����P,	���_m�zb����G#N���n�;�BP��!�D;�v��������t�Z�qH`�Y(Lw����-�����v�Xr��
�7�N�8����`Ht��R���	���p��aH9��C�v�'v"iaH;�o$��$��	ppvI��s�����!0�9�<3z��Ix�8������~vA'K��V�	���V��I~���1��m`����s;��1�\v�#it���H@����o�`H���1`Bq��Z�B7��{����B{�>�'�3uAF����bi����T��!���3����_2��	���t4���e�p����f����Sc��#9��6����,�� �m1-���`���3���o-t������
��8��! *�j��r�y���������R����������Dsk����!`��5����7���vR���)R��=@����#��.���zM���b����@���������{D�IaG$O���r�H�y"w���0G�@��fY�X�4;%�2v-�S&u4
���u�Rl�Xj��h5]O�_`0�g�8C�}^}��y�����e��N�p"��i	�?Y�%��	�d�h�@n<���6�����7��y��X_�9`�y�'�(W4
�H��O$@6�&'P+���m�)R��k��f��H�:"�d����	���HR�#�����&�$���>wQ�uD�4�$�q�'����SOd���b2]�M�@r�ce��m'��\�d*�l��qE������4�%����
���w��!MU	t�l��i��l�8i�I�r�����!��1��3iJ� �=���Ov���1���vRw r�I(	0�NV&���!��&MBI9��H3�r��0e$`�z���s��4�`�4&�';�$!i*�8�)�Q�h��*�q����Y��UWRNj*�~6�GzJ0
@���t,�b��	M'}IK	�`H��/����L���������������'�K>t��+l;��}�%UI	�qH�S��;��[���LCZ�
@��`H�X]��]���LOLG����������,i���8���k[�&&��LD�8�C
�LU2�K&���u,��4�4W�kO��t�`K���-_?���e%�%��!�G�o(]��	�j${v6_�8�C
`��<^��t��r<idK`����-^��t�`O������Ik����-��WQf"��n�<�N�|W��'�-� �>�|�fuAlO:Z�
@ 
l	\q����4���{��@��8ES6�����$A������4�`<`+)j����)_!�,�@z���f���>HHi�
@Z8������Id�H���K 7~��8�fz��i=K���E6��D�M�D��!kg�T$��)�F:i0
@ZB�r��>�yi�7���$�^��T�`<�I��d+H���|���-i��*�)�������'��`����xK�X���H�VHV��~���I]�,�-R��k��1�^y��V���!G���<�N
�V9Y
�MS�MR��D���%.r��'�@+��n�w �i	��M#4�j��x&#A������$i��2Y�,�����;�^yP�di�K`���f�T�����8�����f5]6�)�qi�K��������N�:���h�p����O�|�'���-6��W�f��f&�����TR��\�
���9�
 ��)�rA�%��=���K��an.�N�;6@���%�8���O�����/���VR��uuXM����u�.z��6�%��c\���+U0�:q�7mm(�N��'�#`H'^�i�&�N��e%
z�$�����.$���!�w	;�w��������/�����������o5��[�L��������?���|��������
���g��vc����w����?g������������^1���,���-�����F6_��:������������:g�u��������������c������|�v������z����/�~�w����M�	�O����:}�b��i\�����:�����O�����??�"�������`�����U�p	Y�g��cx��?��������?����]>������a���0���_f9�b������W�����^'�8��s��q}ft,��Ow*�81��w
<��n���H������7G��D?g�[y"��t�[�x����?�z���9�!_�uY�[w����6��w���g��"Gv�X�{M�F��!��| �����;����d��a�Y�o��?=�e�~I�O����;���i�g�a���7����v������W^�q���w����t�w��Y�d5��z?�t#r*�A�����>�����kP�K������f����N����Eq4!q��cuV��,�������#MX^���]����
��k�G3��:������by���K�o[��X���/U��+w�6��MS��LnF�C�����Wtk�a%o3���a�x�?|������N��~O�<�s��B���_�8�� �iz�]�}Y���4�t�-�=^ �W4m(As����������V����:/�--�wV������^J	Z>|u����sJ���������	���	�q���	�9�E���@�rW�~���t������y� ��:�P�`�e�z]suu�<W8���:�j��k���
��P;��Z�i*W%�."�c�V��n1A������~U���J�����1���#���--q����4����V�2B%�9rM�~�P���U�1�r�g���p�[�~ir	z�������>����{���F��Ok�U�(��6�����%
��,��*�q�M\��}�g��[[x��5����H�	��I�n���q��
f����e���L�Hn�\�v����_]R������w�EQ�.����Y��VC W�`�:�v�=���v�r[`o�XrM�2��O�[��4�sW���B/��$�N�q���w��z�3����^��z&���L���:�1G����Q���0�Q`����K��r�hv�->x����7Q�_�#��}�v���|����r���`a�%���������e�����Gr���o<�"C]�G&���Mdb�_�L^��Ydb��4�B82	b�v����2��5�M�&��v�X�/��x0�/�R�����@�� :	��1uy���Z�����<""��k1��M-�7���f���e��;�Q�k*vV�	j���C�NT�2N�x�r���X-{��rM-�f��X���"/����1*�}�E/�8�&�G�V��E&��U�#G&�[�q��$�]]�M���dF��gVTy(2�j?k�W�J��~>�&^��I:�����Z������\�T��`[��q��S�(�1�mJ1�y+�V���r���s��.��$hR���%�����pX4yP<]G��Y���G��c�m��	�����y#����u��7����^��
�~�i��?�z?��ZbdUy]���s����e��T�Y�G9� ~�)������
�gX�����O5��3�_��������q�4������+$M�f�%o���@\��:y7C�dR!4���|��~k����b�F��,��pq�^F�|�-�V��Y�]4�FVM���� �S�bX�n�J��k<h��dz�yeU�8[5�w�l��������$� <��j�k��m5��C�D�LK�����l��k�.d���;�Vgs34�&�|��n3��1���h}����*���@�(hRY�b_d�G�}Q�[��t���j�}4!��%7���=zrN��OQ�7�
�@ ��S/C�(���wM��9�/k�i��M5���=n7vwA���;G�[FO����,�%�n|���S�w#od������72���$�y#���z,0����J����#��*����o+{�������%��?�/RW�oD�)P��o�O�,/:����z��GO�~���=|���=|�1���yq�9��D��!G\�0�x�1?5Z����5��mEm��uO[�q�����9���w��
��bf �e4&N�O�`��t�����l�#G�&.1�����y�~���~TmJDq���Y\����"�M�B%�5Q\b��Zlv�������n�(E���P��7x������6��!n�f��e���~��|���bg�NP�A�cyi�l�s?�f�`k7G�{�e�vu��!<q��{ik&O���9��[3q\b���%��Q\b���gq���-T���T�,.	j���Q,���$�����R
����ofAQ�%��7E%��Eo�����g��rG��q�����`���M�c�5�;v,��f�(IM5�B{�_��$�{�!S�V�9�6�J,q�����U���w����"���w�E:��I�F����&���?�6��4��<j}�#o2������7J��y<f���������\��6�o�y��u��7����W���4�o��_�<J&i�X?-���!o�ky�b8�b��f�Q9�F��0kj��Y�mC����4����q+�����r?i�-�|�!o���X�j�f��c����S%�g7��28���g��[�q��Q�`���-��~�L����T�,v����Ryn�S%�}N��[[9�&��H��������r����an����~i�\��I���=���r3o4Ql�E��!o�53���er�������-^y���s-G���9�V���~^��j7j&�n�	�n��_����&���<y7G�d��'n�3s�-zf;4����/2��|�an5�s����&�L+k�������i��G��g��U�uR�����i������>����VO�pVtO�p��zr^�.EV���M{�����UIL���j�>�)���p
�c�������1h{P[���������Q�~�bDz���Lo/��vN�w����-@�s��C�d�����63r�����|�Ai�2�G��O�A���|�B�w����Ct��kI��������~�3���}��{q^q��Q��3�/���?v"Q?6�F��%Zu]���8yV}E�����
�Ut�	�)���C�@~�y_�M��*�+��Z��;����B!Q`������(�W��F!Q�s�C�����Ut�o���/��ME�3����,��N3�b�D��*�{�R���d�2�?r��PWr�x��y��Y���]q�$���@��oKu'Q�?QK�F�����?���y
endstream
endobj
150
0
obj
6051
endobj
151
0
obj
[
]
endobj
152
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
153
0
R
/Resources
154
0
R
/Annots
156
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
153
0
obj
<<
/Filter
/FlateDecode
/Length
155
0
R
>>
stream
x��]M��������AWuuu��98W��x5�����x;��Okx
c��
��-���bDf9�AW�l�'=�AF�R����������c����o���4^>��9�a��6��G���$�����}3�y<�m������fQ������O���������} ����n�v����_��������?���vc��U����L���.���t�_�6u�i�:�:�zF�p���v�������d@:��N?E��Dg&N���U�9�1,N�w�9�aj���4u�W����z�]t�f"���~_n�f�	Z�W����MM`�u;~��s?B���,WC3�CO\����k��_������qmC����F��n��~��O��F�9�#� I�!@4W�v�W.�L��[b���mR�����Q��C���!�;s�Y��v��I�����-@H�U����46qn�x�F�{�x��C2`�Cz5{�+;0�f����W��A8�n	B$Z@�J���o��;�4��A8�$��P�[�t�*����-�73 �{�*�	�F"����g ��F�qN��C2@�\56�_�[����`F-�����-Ut4�A$ZQ��s�/�U�?R�A]
����d�)��/7	�ab��!
k3�LL�0N4��4�;	�A)Z�)m�6�7	��8I�!�q���bfa�0��Fx��!����*�;�A$Z��J.{���T��Y�D�� ���hL!
�C6�����&�����x�3��6x�0�����$�J�`p� )�!��A��m��1�d��eS@����Y��C�MqP'NBg�6�T�����]�3XA+Ic3���R4!�Lk�Q�W}>�2�b����! ��9�9��Fjy��-t��!)���������6\����N@��`3�7:mjP���^�P����,�F-Yml[���\
����]�,-%�h;��h����f�j��jF�U�@�R�{{3Hc�����pi��x��m���E��\��8=������S�����������r"�G�`��4)P��[K�A[W���-�����v�ds�d�N���
@��C2`���/�%a�$��9�l?��^#�b0^b&N�
:��0TH/�{
}uP 	�!��������A���9����0�n�C��������_Q@�hB�N1�q"����H'��!@���l���
�&�"����e�xy��U���p���
=U������o�$�AQ�j�(
9���!��=�`�����*R��h������D�z������P�gQ������2������(�6�����1A�Dq��E- �2-`G��9>��@�xu 6
/drG;'�&Q� "Q������
�`��(����b0�'�;@}�:�s4W�`��Q��'�
��+?F�xj��t<W.}G��x�����vf@�SE��������;3�c����;��6C/���}3N���B��������
0���po)�/��M@�)�,7l�9�:�^��0�vxN���+��qB��p�(��V�1*�!������M�S�������7�3�'��XO�, ���;��f�x����J�0�KG��=
x����z���`����RF�Q�4m���!��s	�h�Ih�H�C2�j�J�3��@��4��B���;�y��8j'��5UW}�V�
���������
����C3��Gj��('!���e\Z����-�K�
.�Z��;V�46-/����oNd�� ,���T;�X^|����ax����k\�F�F���;)���������M��I�=L����0�+G�&a�0	�C�f���<������M��$
@��� ��j�2�o�,���:6��,%��@��h�,���07���r�Fwgawl��@��y6�|$��(��E
�kE
 �!�)��0\�,Rq���!��n����F��]+����w�����z���	��EJ�\\-�f{Q�)z���<�e��t
��}�����s��+�w�d ����u�p��7X�8�+R�nz��,a52�&��
�NO"I�����^���7�4�2���s�+�[|i�U$��$r�����E��B��@������^��I8&2X��,@��F�Jd��	��Y��2��[_RB��g��lIj�����t�;Z6��UwD�!@U���c�4���]Q)��n.��L*�\����"s�q���r�o����b"s���i�	����QxW�(��v��v`����}#ja������F��w��Aql\�����h\��rT�����m���d@�u�^,@�8	��"Q�.~c�[Od���(����h�$��+��~�r����-��hK�88$�,��:B�8	�L"3���J#�Su���:*E�7l,���4���s�K�,P�{"_�b����<�!���h�H�A;(
@h�������H��Fd�@��Z����7�����<�!��%��`v+�(
@h��K���(�.��P6�cP�����
���<����{�A7Eii���Y*����,0�
�+�*/�K�,��� �90�����Di�VP)Z��o�&*���*0Q��!�
�Hi� �@��E _W���
���(
�V�.~�������0�Z@��8p}��B!�@�qy��!wo�BU�U�P$UaR�����*H'`��9P.��`86"Q��2,�u����8'Ru��,@��}W�8�+����,����b�@�1o���:�X�x_-��^�/�N�.|-��^�/��*g�x����Ad�@r�1��8+[����WJP:NB:A��9��������������2����Q(ZS�pT���v�u|�	���"S�H�s��MJp���nl6�-om�Y$�9Xw���i3�'`Yr0qu#E�X��N���"mF�hLz���r��V�y��s��
���"q-8$L��?n�0����i��8$�!��GF�8	��tHP�����7��E3�C	���66�H�Q*��Egd�1m#���I[V$���yH�����Y�����/{�����O,��������x$��~��i�����y�t���w�|hz���;|�����w�<,S1���;{��|�>���������b������ :���?�������������_�rs,��������/q�i������rC|����/:������/=����X�f{�Sc���7�o���=|_��hr���|�6�C�[��������e9t���������_�y��4��0/��~�O�����C�E,9�U��������&v���T�Q�D��2�=�]���]�������~��i�M���?)����:��b�eP�N���z����������������8�O��v�y��e����-nuv������*W�v��-��u��b���K�
���y���p����"x.]7���N��u�\���e�����Z?)�!}Zm�5���xs3��&.���F��{��~��h����b��r�k�~��K�>�8f�C��h���x�k�ho���t���F��
������)u���$g��5��K#��$����.���~j�?_�*/�C����|����2�U�k����~�e���P�Iw�����Z����.�#s8�C�\�U�i�U������k�3��6�^f���A�E�Bc[����;+'�x�1!�<z-�i�1��+��F��th�����0�K��`�7!P����9e�9u��]d��~27P�a������
L�
����6�sEi�uY��O���tY���C�5�?;tY��4���;H�u��a;�}L"��~���?��(S8"�q����st�6����*h0��j���q�M����!�<��`P�f3�Y��q��������[_��^j�?WT��P������8Tg�2�V�7��.�.
�.����+q��[��1n���|7��>6����&�#��O�B�I������~w�����~��]"�~�*��4��{$4��������->��%��A\8[����]\��>����Kq��t�1�/�}~��m�[��s����GV,)^1�����4���O����W����9���mf6��x����M�S8�j��d�G&���|ig�"�fu����H8r���#W�Y82�.����
�c���&��
zih��R'.����I%.�]����������m��������0��'����N!.��P�2�T�H�s���:W�Y�3�.����[y���M�MX��Cg^uy��j���K!.p����W�j�pd��#��|>@s���A9���,2��S%��Kk�$'T�?�W���K���	�s./�adR��8������#Xm-�;������jk
w��!	w
my���N!~*���f_*���-�4���"��Gi� �He._F�C����(�>C%�]���v^���K������H�������u�-;Jw�`��(�[D�����rn���s�����e>���%T�Np�O��|KQlc��~�fu��aV�u�*4�����q���f��;Xw��V]sf��+�����[���\��v��z3>j�U�}�\��|�a�Qa��*�����]�`C����&X���6H��B]���tA��t�=}*&N�����[��J�5n�\��	����*�K�C��c�`!����q2L?
����w����0d(�>��J\Zf���7��.�P�tA���|;;��
���Q���]�`]P��bE�tA�����u���wcvlP��5�66���rl5Xk�V��#;V�]v;�5���c;�A:���c_����:�A����I]�IFW��|�t��w/#��
�������U���Z=�:��:����~��C��j�g��s����9�v�����}��������v�����'@T��'@TF���0�v��J����$��yh�V<;���F|*O�q���,����b�asi,���*s�K1a��q�y
�)���K��+�N��J\����L��y�R��"��>)�>�H!>U��!���d�\]f����!)�U�hJ�O!._5��H����3����GI$R��R3�lk}��!Fg��gi�.��"��mqy�#�M#��s$���Lc���4�Y6���bl���_O�Y�)W� �nu�j�J���k���m���|�E�X�P��"��>)�O��H$)��]r>_�Y,ru��"���y��\���Yq��J,�i���]��Tb�|O'.��B*e���B�z�Ce���i�Ki�OpS�+��%�M!~��e��A�������e2gQO����+��*� .�A���W�R��>����)�o;���`��0���)��u2N�6����/1�{*qi�F��oT5X��R���7+�>��B|*O�iov���f�������a��q��h���E��ci�<j��J���4b����T����>_/h���(�����r�����\����4j���KL��]�na��5N��W�v�5Z�h����)����>0\�`��0����7�h��W?������Wa�nx��Q��e��F\�.]�U������j���������tA�����*u��
�f��\��"�x���Z?�*�>��B�3W�+*��O�z����lP5�(��G���3W�=�x4�k���f���&�O���}
�K=Y#��'�z�z����n����8��
<�aQ�?�hR����HF��{��G2����T������l���������:P��8������R�v�s|}�q��hG-���.���+��[s4j������H������F\����IP��	Gq}8R�}����������#W�Y8r5��#W�Y8R��x-	G
u��_���o�#����#���K7b?>���k�lK��	oqub���qCg~v���3o�F��f8]��K�9�����Jh�P��;���m�T���"	v
q�p�~���HC�B������B�4�R��D��i1_�Yhr5��&����~N���a8R�+�5�
��r��e��N\�c�:Z�w��=��1���n�u5���-�4�)����j4�)�����7�Yv8�"��X��q*�����B��]\#.�6R���(�H�~C��
��K������vA���fE�tA_;h���~��?j��kP��KG�vY���K��4��#^%��V���K�����_e�����@}�`�����M?�����5��`���.�
��uif]p|�nP��glU����
�m,��
j���

?J����~T�v��u�+����;��t��l�<�1L�>��[����5n�q[���b�|Xo9�|��0?������|�m��������/��4�����T��P�iT��K��*�O5
�.�j���~�.�0?�<�I��juM>�Sg��a��E]�l�v��VG��zl�h]�V������W�B���ck�>;�l?�����]:�E�0!���G�qx��<��.�F\�=4�*���������~8tB������F]����V'vx:R7������9�M;����D���'F4�5C��������=3�^I���8s��^�v�����?ULc�A\���T��p#>��TT�'���s�F��xc��\��H�%d������rsL���=�������$C�r	�o��s2������q~�F��e�31	Lq��	L
q�7���d�r�����,0��O����0��
!�h�'���4b������a��F�Z���Wh��J��z�����6gb�mq��K#�yn�����Il���s$�i�5�Q�I�N\Q���fQ;|��F}��*��%�v��o��N
endstream
endobj
155
0
obj
7005
endobj
156
0
obj
[
]
endobj
157
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
158
0
R
/Resources
159
0
R
/Annots
161
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
158
0
obj
<<
/Filter
/FlateDecode
/Length
160
0
R
>>
stream
x��]K����c��`gw����'�$��5���p����O1�X;p ?z��`�N/Y��1����k}T�J����{����O�_|����r���=���Y���4�����x��\� ��?]7>
�<����[P����,�}�\��~�/?�����\��2�O�����
`/��:{�u�(�������>�5��e�gdx��nHC
C���G�g���2��u�a��Ct��s��<��x���0�!+O��4O�:�OqC���[��W���Y��o�z��x
$���������n"�z���G���d=�������qpC�k=��Pt���I��)��v|�b/�ue��Xc`��'�f <�G�V�>����x����;sdB��
�(j�<����F��y��5� ���Sn?x��
�*)t����9���>tL����=��0C��P�ic�b��c�'nP�c�g������������+�<��*y��;&y��C
=����<Q<���wLq���a`��Q�'�F����_s�D�S�;�x����r����5�����))�:������8Q�C�=S<�L��p�h�7�3�3U��;�G=n#���5XTI� p�OLyHa@�|�������j�4��w[��mf��L�6�����A�|�L�)*J�@��z��g6Q7Q��U73�!��R���"K#:�����T����rE�xa8����]���wq�h������$�TQ3�y=u}����X�L
�L����z�(���v��3�s��>0�M0����i�3�����(j��K����~�n�z'f��3�QR���i~$�Dg?1���t3��0C�]!/8:Q��:3E�@����g��`P�3�:1�QR�A���`S<�x�L��pI`:��L
�xf���P��7���oq�o�F��L
&y�L��@�����*��
J�|Er��iHaHC�L��#3!���c`��!S�!��r�Z��C{2���6��#���2�}��h��V2���6�T<�JC
W&;�^�l�������������:��7<���L������zRr�TP�i��yV+��)jB�Ub;��M���a��S�X���adEsi�G��
u0e��������I-#��B�D��cM�����a$�Q��,y�����;&V���}y�$s����=S�!��p+���*�7�:�
x���Y��S�s� w����'����a�����I0�m��dO�o���afm��@��=�puu�e0q�o���afm��@���	U���z�o����cm������g[��8:��A�N����r��,_��27(��*
)���QHV���^B�X/a�3&ya�H/��VM��Y3��f���#���mX�L
�fBf�C
�&�-Dr\�d�X3a7`�G�;(
=)���������^B����5��u��T�{�J��������@��C�����t0�G�I�k������G9�t�'n����*j���5�v��&�'f���2QRsJ�P�+$Hqb������Y��4LU��L
��n0E�@���)�'�d��'j'r�52WR�q���y��8�v�n��]���Aa�!�hfn���0U
���z;�^Y��I?1�C
�~�J-�Gj�d��S�0t��G��%��0Pq��3S|�����p��d���pAoD��L��czD|I<YF��I�G&�wL��������
.8��J}dJ��)��0�/����T�>2������i[�����j��K��T���)�%��5��%+��b�y2RGKBE���d�f����(j�`�B�|\`#��IU���!�"Z2x��.~#�HM�x�#)�%C��#�$��$C$�d��N^ R�"H	���!��Y2Pj������i �3b��H�e�����]<07�p��M�KW������gRN��
D�:F}2SR(5�E!�=\��p$%;�N`���RO����\�C�0��`GDiHa`z�IN5q�F6�%��t�0)�n��O0�2�&�����4�#��K�L� �l�+���P_��Q���.b�C
��]���
#|���:m��$�pA�� ����+���0_x�Q��#��kB�s<TG
z{������������&�EMD
�z&8\ ����
���Lp�@���WFW��)&�8\���������+��A��)Wi5��<�$jxb��eZ���/�10�#����
C�v�n�*Xk�����AZ�"��(u�Z|�v����A[�#���M�788�E��cM���%YO�48r	��N��Y���f �/���X� 3g��z���e2�a������a�Uz�S���ab���@���e�Y���	2s6L�-P����+�wzf�yi�H�~q�+s���X��w�	'�|��/�zb�z������qk%s?��y7 }��4iHaHC>����ANV3UT����W�9Y�<t�P.�{����������������}�� ����^�������5���#��������������'~
��n$����������5���������!-����L�bMn���L���`����y�����?����u��z�����,x�����������?����r����'�����=n��rv��������*�����K��U��\}���P�����M%���^Ee����m���~��y��qv
������:���e���o5nj��%^������x
�:��r>��qX�����o�J�*T�7e
����:}������*�;��*?/o�MxS���o����=�jA���|���5�?6�
x�w�B�~�F�B���n���y��Z��	�6���{����G���jA�OA�����	�o�c���^��,?����C�r�}O���+�<�^�Z*������7+������z`o6��x��l��
�7�qbXt��"����sX/a���(����lrn���������������V	��5�fW�����U�+�j�]QCnwE	����i��j����U��_�5{�|W���yz�c��+��Ij4-��U,a,��M�P���p	�����D�=
��M�p	�q��+��z`+����:��wW[�h�S��}O�)�m� �� ������>L�i�� "������t ���|_��y�
�*c�}�`m�\nO�`�t�T��
�A:. �Xt�e�W�5+�g���f��,�U_�����-����)��K����a�3��x��l��
��M�[�&��M��H�w��x8��;�Q\K�X����������������?W�#��,���eJ$D��[�.%@v$X�
;���4��h'�m�5`���8�j��i?��r{#!�n@������|h�-;�A���M.J��	�<\m��m�l�Q,R��c���XD��T�l�q,R��s:	�a�&�E�U,R��C�E�C�gF�H���s����x������i	���vqtS���������m�F�jA��-2�n
r��h	���3��&��o����@[D7����3�np�������G�h�X���"�=	���s�L� ���"� 	�M�i�X$�����"�:!B�H����#&�V�A�eQ�}���	�f�2���k���RR����}d����l�|�RO���D��Ko-��K`S�	��=����]ms�m��p�+J��+�nvE[-�+j���	���>\O���k�=	��U+���Gi�f��*�6/���7+����ls��������^B��_�������1�{C�m����	��m\RnwI����\�T��
r{�+!7�H�X����F�'�,��mur���X��"��n��.��-�o�&V(����lS�jAb��l���-�n����|	�dk�\�������_ik�m�}:p��������}:r����o����Vj����!2�AJn�Mc�nyki�Z��6��2��7+��y�lQ
[kA�YAn��e��������'�!o��&�U���F�=���h�S��}O6�=S-��)����Q�T>�X�$�����<�}p����6�u�w���i���!����!#�T��&�{eX����(�)���l�#�Z@m�qU�
����*��Q��S���K~2�C��Pl������W�3(C����c���`�;��� ���"� 	�wg��&��u��'[�����N
���|�u���P�3-	�$�2���7�pKA	��V���Qt��[wL��;�5����Ct�����M�6��$����Q/����F�h�LKn�E
ps�%7�+�c��$�"�X� �gZrS,r��h	�o�~[�k&��SC�6i�I���������DnK�D����mpM���v
p{����_�1�T�d��$ |��l����d���3�/6���Q����v��m���jA�OAn��"!<�Wnx�'�=�`��\�0���F�'�j��k�k��
p{C�xr����d�^�����R���X2��Z*�~���5�W��)A�x����J�6�Rl����!�g���s{5`��W�6�&I���'���~������UZ���{%l��Z���
p{��-�o��
r{�-!���CP��������
n����o����6�J�6���-"��`�������������/�Hu�l�T���D���"������j�Z*��x��\�J������x��l�y�&��%`�o���Mn�������x��e��m�On�d`#�3��x��l�}r�7f��	�F��$h�%`r3��������8
endstream
endobj
160
0
obj
4971
endobj
161
0
obj
[
]
endobj
7
0
obj
<<
/Font
<<
/Font0
10
0
R
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
14
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
19
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
24
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
29
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
34
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
39
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
44
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
49
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
54
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
59
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
64
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
69
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
74
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
79
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
84
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
89
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
94
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
99
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
104
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
109
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
114
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
119
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
124
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
129
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
134
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
139
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
144
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
149
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
154
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
159
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
10
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+Arial-ItalicMT
/Encoding
/Identity-H
/DescendantFonts
[
162
0
R
]
/ToUnicode
163
0
R
>>
endobj
11
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+ArialMT
/Encoding
/Identity-H
/DescendantFonts
[
166
0
R
]
/ToUnicode
167
0
R
>>
endobj
163
0
obj
<<
/Filter
/FlateDecode
/Length
170
0
R
>>
stream
x�}R�n� ���Cd v�JR���}�n?��:E�1����/^�4I�"�xvg���b�<6�FZ��A�ig�	0���nag�����[���"��i��7�H]��=5�&z�`�-���5������M��{�
=�H�������/�Z�l����qZ$�/�c�@b����������H����~JKp����j�e��Oq��/�g�|L�H��J.�Y��m����.�Ky~�>B�L�?7-��.��<�
*�����b����W�U��
���?S�
���9��4%�!
/Nf��up�7~��
�|��
endstream
endobj
165
0
obj
<<
/Filter
/FlateDecode
/Length
171
0
R
>>
stream
x��}y`E��S���se�g2�L��dH�@���p$r�-1	D�
"r#�"x�������� �<W�d]�]W�ouQW%���zfB��������A�_?uuW�SO=G��#"-'������(�%w)MSfO�y����H�^�z��)����:��E����&�������D]�qO�48�I��|�����h���!Q��/�5q<�~Q�$����hvRG�e�e�>0{���F9g#?���I�Gi6Q��G>"�c�A���/E]t��/�/��;�������N���v��L�h1]��A��V�}t#�E�h	=����I>��/��?Q7&�Gh+]���Dn:L/Q-���e�d�4�Gsi������%�a��O#�1�Kz���l���������^��b�.J�2HCi,�� ��,���+���(��4
=/��~z���������2Z���^�$3�Q
MG�yt	��8��,�dJ>���7��0�NTB}��`6���6}H�a����I�eE�����0�L�N��B���.�����i;�_Z}&�=1��D]0�2���7�W/������\��
`��t����U^�W����"��S*�/���!�'}-���lz�>X��/�������~��s���t���.�t%��j������6�c�L{i�����7�=s����dS�El��v�=��*o���}�%)(5��d�����y��Q��G�D�G_�������n���\p���d�|5�Hw��-�(E��G�w�Sp����<��:���+b�l8��T6�-fW��:v'��E�N��)�,���}��g�fn�N��9��w�]�P>�����V��?���������?H.��O��'
�Ic�Y�"i��L�~� �/�X?��/�W��G�W����r�r�r������L��l�p�4���f�����P������+���d�������i�>�������]�I��-l8�����fHdQ����G��*��-d)Ez������[,$?�6��Ik��Hv�����6_~U��Q�����<���m!��z#5�f�=�C/��
s�i��d����k)��l�X~�>��p�*����������%�'�N�����z)�i�I�l'*���b�=y���
6�Z�#v����!�J�_������Bl�2
�G���@i
O��S��Giv�Kt���e7a���C4�����'����]*O��a������^�J;�����.i=�6Ko�=*/b�����F���I�&mWJ����w�FvD����2�yitt*�KN����w.8d�-��.h�MdF*���k
t�����\��}�s%�T��i(����0y��N��.v�����M���/���pU��+{��(/�QR���nE]��
:�w�������2;d�����)�d�Ks:��6�����*���`MS �����t��x�oS�	����6�@��,pz�0ZNi�2knm��@%Uv)T�����aD����"���#-��$d��qG��7� �������VW7���������&[��v�
IRop�v�����V������QE����#i��b)�z����u��3����FX���	
��8CF�gt1���F7��b:t]`{���k�5���O
N?�."��}�B�������Tw��[��6CZ]�����W"F������z<������5�z�����S�Mjr�Z�4]�X�}��V_��I_����w������Szu`�yu��HUF�~|��=�z���i�@��5]
�k�7�;���=�mbrk��2�������dbD���H`b#�b"��2��VO,G3��3���e���kZ����������%,{��������r�oI$�p�
��H()(r���Bb���|�.��yip��hxn��Y�gg�U��9L��,Q�hB�
��#�I�H���5�5��7!��H�k)s^�?���\=�g���F��X��Q��#������|�i�X}yk]<I�W'e�x�gHF-$qlkc���G�\�3�<�Y5C���hMb�zkv�yS����� �n�3�3tz��i���g_-a�r|^������j�wV��	jV7���/�h��{�F�q������6���������$����VN}�����l�������U������k�[��#�����F)o-����`I���FU��0�r�V6
���fFF�9Q�hb3��iF~�#.�_��J�A�9{�u���������.���H�f�f6)��������y����l�����RIUHk'q9�[�+���#�N�'�
�D������<_\%m��M&n2�U�b��&��nWMf�b����`�Y����vn��,�{$�	s"i���:�9[15���U�eI"��!��W�Bi�	�UUj�s���<�\�
WE������=���/t��L�����������gVWU�\�=sV� +NJ����:���R���<uN����*�aE��{�����)���f��~��G��1����I�)<���J:�y�n����3�-����2:�E�(������e��e�@6c��'Wh��N2��G�h�L
��T�>j�g3�4���*��q����wC�a)B-t�*T��P���^&�B���qN(T���{�x���L�)�[���T|J{��u�f���o�y������+&������K�-=���c.m:��h����_��-�����eY�M��`�A~��c��Y��o��m�L���,��auW��6�l��Iv�b�O��dc�U���|,����o#��4�������(�Ez�s���7Z���1K�
Tc��M���P(T�BK�j_�R5"]Z2D��XZ
iH�h<cQe���(�*W*��A����FpK��b������������J*���?��0�H�g`���h
"�*�+����y%�
��I)����g\�'3�I�`����`�h�J�)���{dc����f��!1��s%��@|Bgu�
X�bV���?E_2�'6������!�����T*S�,��9�9�m��^��H�)O7��L�OO����<+}����k��]����r����>��t�����W��J�i��';�Ib�G��W�C�u(I	���8$b67���c
�*+]noEHPl��������������@P���T�Y�'96��r9�=))�T�����a����;��]~xd
W�,w�����)f�]�K�.��j��9<6�C�_�OP�/��[K��X�}�|��e�Y�X�l3[�[a�,���
kP�K}����w�bq8��v�4%���J�V�(���=j�f�.��g�lE�U�<�y+��b��l	(���0EIKw�R|�����lK����-E�������S\�o�,��\KC��;��&i�!�mB�J�DU�X�v��A\P�
q�Ti�@GB;>��*//g����0s�B��S��e��`ri�IeAC{J
����dZ�^�������cbUc��O��������;9�����Hk~����g�k7��#m]��������I�V�5�MO�/-T;�
G�c�2��\S:����\�m+��I�,VNMKr��5-YKJ��6���eU=��ZRR��a\8����6������d�c�������t����fO���?Y���\�j3�&��,��"�#9yi�b���|��=�z�����zlaC���c�j�����X�����5jD�����������ZVU�y��	b��%�m��Z�j/Q���@)K�H����UL�*2c�&������7�{F�]��/?�e|�3�O�����['=��Y��k�a�oY{����A���\*��N���G���r����=��'�LE�
���S�����\���$9)7+��������C�Bk*)�����#���j���_�y���,���)�U0��:YT�gyP�
��%��e�����������
�>j
����3ybHRTu���XK�����!x$vC��PGB5�_�#�&S0'�GI)Le�2K����������wU������aU�����KK%��mo�5���g���R�V-�{c����}v^]Z��{�g��?��+�N������YY�(���9�����'7j�l��f�����,�����S�8BT�;a��07�< v�dxJz�6K\��a�RV�r�u�:��r�r�I�M��bZ��$�'%'�tK:J&2�AF�0lI�~���OV��i��Uu(�P�aX���������'������,�H�F���QmG�^�Nng���L�}	�8���XY����#,Y��S�L,#���gn��j�%�X	�����&�������2�g�w��]s����/\v���7t)���?n���&<P�r|bX�w_������f�=�v����j�~�5��~�h�y�u*�9�����t����M��'<Zs�TX�s����K-��;������l���x�����ec�Q���@��oS��������+Gj@����#����������~g8u'��{g0���G���B,1qL��LwG1}5OL7.�R���?mx����m���6����,����]�|��������\o�<wu��+V��ufy��^��y/e������b�� W���3�������j����Y
���8�a�xY����Z�X�����2e0�k�b�9�P���2����z��������������������c�Z@���e���=a���C�����E�
��X�F�~n�b/
���*�Zc�������_����vE_~0����AK']s��>3�����^{�������/������?���z^���a�}��i�C�R6�`C�\RW�WL�9���K�}i��s�v�=�&�I��K�W��$�RF:In�rfRG�5��a���F����u�
n�v��]�B2����p�'`�f�ao����s��=p�w���1���#�C8D��8���Qa�	Q�;�Q�^�N�
�G�6�$�e��e��G����\���{��������Nl����c�f�������b�����s�������[�^��t���qvI#��~�_)���0��|��V�����G��<w�[��b������B���.�,Wjj��he��9���R28"K�Th���i�+���+5��,�|8�E���{vy-/���Y�}��
Cw��b��9���1��9B�$T��B;�*Ze%��J��9������zs��jh�X
J������+�7������{�y|�g>P�����L��y�����kNN�#z�|$(��Y�^������c����*������e_�et����=��P�<���RE�~955����Z����}iiA����:i����.N*�{T_����O����N����V��nY��{��[�]�}�+��2�Mm���,��F���pX���������\��$��Y��k����"���9�#������oQ�5���?b��[RD���j���0�J"�^���1�T���C^�	N������e�����6��M=������{�����_��[���/s:nz#8������qJ������7��{���&W������k�{{'+����~�zrZ��G��Q3������.����������CM����/9Gfy*���I�s�Y��|i��e�:U]b^;���q�I����y�s�>�}�`����L�<IIZ2l���I�hD�v{]�e��4�R�T�ty�+I��4,i\��$�'�,�%��������7�iFZlm�����������7�
��m�FK,��5x�z%v��H������\j����������(�����^uS���+3���cr�8���&��nn�m-��+�t,�;u��h9��ll��F������A�������la��E�%�,�-G,��RdYoy����[L��2��JbR>W���<�d���I1���\S��e�fp
zD<��.���\��%�3�9��8�BsD\
_d������R�W%�{�����b��>�(�����������b�oj�����������P�����Q\Q�����#6����N�9��Xe����
E����]��K/��,���7!���m���hx4�a�r��"��_)R�+�*�]Q�$[�"H�dg'�M������+l�=-vQ;.�6���1�Q�����H0��9:�!f�N7s��bL��H0�(�(�mA���A���ke����S"(e�����~�'�C+����k\�0l���/|<��������t������m9�>�5U;j�����8./4YX�9���\��1�
���%i%R��
'Y@��iC3�����' 
��8��`�����������v��L�,�x6�W�I��]��(YK���g����ZV��_�s�(��
��vJ���0�����P��{��dH.�d��3L3�#~F����+4�PH�NO��	!�B0������������;w����W���9����kw����r����E�o���y����r��sj�_0i���g�i9�?U�[R9a�����@Ix���E�qM]��������j�yGs)�1����S�b�"����P�/��R���&�{��Z�����Z��������j��E�.��)��MJ�]��������i a�mvJe?�/������:�����������6l����1t���T��*)M;�34p��B�8n�8b��t����:��:��7~���+w_�2�������y{gNZ����-��{'������nl�I�pG�{
�����?2
�H�#;YXS��nR2R<YR��!�6��?+��o�����B���%Y�$��_V���
w)*������:[��y9y��D3�!�PFX�9�P3K��b����Z&����������o|~K�����0c��Y=�N�N��x��5sz��Y����zqO���>}�m,jZz��1����������ty���l�1�S���/p6��p*=Sz���!��)J���J)�+�������������k����lS�/�^����4�����..ez��W������w�������a	�S��t�����E�lq�`�W�U��������r�/�n�j��+��s3-n#������.DF��dH�q �"Pj�X��p1v���T��Ja��oy�sox�v��E�]is�u`ES���}���'o?�/o��4������K�����Yy��_��Bi���������-�))5i�������d���r9+��bv-SN.����p�(���s���q�����n(6�&�g��M4��+������V3l�{Y%���o
Gg.�^"Ov��8��&I�1��k�Yq�Rw�B��U%,2%�!Nl�X� an�2�������{��A�{{��N�s^���
�<�N�
����'b�f�����	���5����r������l��e���@�E��H\-]�{��)I���V��{���`�]�mR&,���SG&SG�:���U��������
E�iau�V2I��.���+s�r�\��U��&�c�J�R���,�`����-='4�Dc��F$��:��zN!�)\�
���Q��7�kB�eF�n��u�$\{�n�N/R ����15o��n+j-���g~����z�/�vy���u���Hi��C+�W�`3lC/�i�STP�4����v����4��{������Q��c��1�9���&e3���p� ��B�J��;��D:��I2L
�(�In[���,��_�`vh���X��y�NO����3v$��=�3 �d�l!��e�E�4��;3�w����GO��;�@��2N!=��1�%e"�L.�D�����Y}��(�-}����e�����}�7����[\=������5L�,�lM������o9�j��+
/H3�Yn�=�>f�8�C�j1[,W������j�"=d2w�������|�BV�U����4�cj�'��r,wj�0���c��hf�b&sZ���h����4�UI�OM3+��.U�=���*�ii�����J��������=�MaG�����H�����WI������b�9vB|�I����,$�C�	i��@��R��:��Be�����E'rA)m�sKo�Y����s��qzvjf��w<���S���'���.��7hYm�j�iF���X�Z���`w')`���������,���&�f3)�5U�d-'�]�Y���
�f����U�Y�_2Y���Y3sp#��q��f���3���
1����j�V�U��,��1�	�����lu{KX�>[j}���o��q���`��l��Z�4�=�&~�1���~�0�t���asvx���vN�%��H�m���H���<�v����-�X?Ivt0���A��lZ��t8\Iv������z|i��^�U/�U|�&&��^��&�;�j3OvL�x�M�����O"��'������s�c{��'�<`e�f�w'�c���v:�Nb�'fZ��k��9:g�ws>�Z�k��}�kIZ=��G�!�w_�W�����lc�
�8�7�
�/@~�E�peC�s�Q�Q0R�Kr�5���1��m����U������w�stw�O���sy�c9/���3e�d^�rl���|���������g\�r���!_�{����@j7[�N]iU,v[R��k-��'��fK���K�df���V�:<��HrXUMJ2��V��d�J�dXI����a���}�&Ja�v0J;Zt�j�x�a�Bp�x��8���{%�����)WY��'�����e9�%=���s��?�2f���S����L�;�|�~�t�OC"s�Y��YZt�u�8g���4������>�_������?Z�Z�S�jA{f4�U�J�����GZ�����_����o�+Z1���R�y4��f�!���e+���G����R&}!o�i���P6t!���@��y�t0hn���V��q�xF+�1���YJ��&��W�>`�c���T�q�Zq�L��c��������P�8��@�D�	�u�OG��HG������>���xN�
�^%D�y�2�0+�3���c�E��h'��C��Ct6;��Q_�t)��g��G������q�P�K��#}��^&�iq�������'�V�<�����1o��uN|L?Gl�����-N��gX���b��z��h�|.���|H���|�S�xL�D��I�����}��b
��"�t��Q��t}�r�g����Zk���!_���w[����,L��pW�9R:������o�P����I�#p�[��3�����E��b�Ym�'x��d�|��yP����5��-<kU\���46.��cH���8�?L/�������H"@y3��A��y���4��!�_��!��9�2f���~/�s��i+-���'b���L<[����5�{��sJ�<�L�R���f��A�V��}'d_P�rA�b*2+�-A_��c?�=�JO�U��2�:�����M������
~��]HO����*���J�x�"J9�r��L��N����������C@}�]��/
~�Nw���_�9�������*�+[�2#�3��@�NP��u�����7�-4����u��I�	�s�
$(�w��s��a��������N��^SO%Le2|C9z�(���wi�t=��?��l9l��,EM�
���D_�
�R@<tv9:M���R�&��=:?.S�,t������m�6�l��h������a�:0!����~��|~�����e{j����>{#1���:R�9^�����S��b��;
=�5���-���D��a��c�.�}�fFM��4�M05��/��TL����6����=�����O��"aG�nd6���t��o^�.�����4m�S*��{��}h��9T-�A�B��7��b�H��o����FQ�����<1l�t���&���G���=q��L�U����&�>O�eb����csT2l���>ko^�������=h�I+����=�9����_.����{�Oi��e�q����q~����!��x�i�a��+A�D���H�����X��L�u=��+�|�y�^/�}�B��o [�x��BVj&�K������R.�
��z���7���y���'n���H�����Wh��Z�|��=�J:�}$#]�����	���x!�1_&�����L�����1?e]&}E��=T]2��0de,mEU�~3p_���q����^
���$��Fz/�+���Q�'��Lye���G�Y���_���@�$'�����n���BN�����.���P�������%�;V��U^Ew���X�N�J�����H�>��_�����'���Z��t

���(���|��<@��q_cM����`�"
�
1^���6^��u���'�1��������>�F�#��O��1����x�

�5}@��2��7������i9�y������@�@Y/��~z9��a��;A���������+1=�e���X�����z���1�C��F�R�&��������A���;y+���b��~���O^�{�E}!0�E�4!�?�k���/������c������I�5M������+���G����h��y����i3��������*�^�Q�cC������g�����<!t������>\� ��]����F����7�����5B_�=D��.cL5F�^��<����)�z��U��Z�b��������!����6����%�b�7�+��Oo�1�������=����|���7�8vvw�c<I1�s���T��<�Q�w

�V�e3��E<!
��Js����������E�>�
T��I�
���F��0����?��O#|4S���s{�4�G�6ag�����\�5���V��w�X�����e�����7������xy����L���k6�b�����X}az���L��W�����r�l���es��h���_��FLS��������M�q#��4�,�����~��
���|u��03�~������Oa
�bg�%#mp<K|�5����`�����r>A��9�������c_���B��
�9�n5�s<��^D�y�P
^���~������t��O����t������H�
��������D���������g�S2��z
���}� ���N�n�){���0��8�z��$�b~7C��?�^osF������� ��|��'�b6�������6�����
z
���������G���N�A??Gx��m���=p��E��rAs�#^��(�o{���/������8~�<�=P���0�_{n�=P���
n��������=P��7�1�=P>��8��������w�}�?���

_^?�4�0>�'��h�\@5��~��������2�q>WT���a ������>�g�[�Ou������H�������|��C��
?6:!�����@O��c�yH�$�*;�{Jf!��}�?�������,6Q�@T<�k�#Fx�����G���g�3�d������/�"$��O�O��V��)T������.���B�^�����z���P�dB��#N��/�D����[���Q���E&���w���4�9�+�������U��x���-GR117���C_�_���I+Q��\�y�^��1�r����io��7�E��
�����R�o�~R5Q_e.�3�K��R�g>��
�g�����6�t�����?E����w�q���i8|�,k_������}�9��?b1AF�;|y���s�c^n�h�o��J>�_�1����L��_���jZ�fS��G��;���1|�����x��l���>V:SD��FL���T�	�T��g>��#^?��T���i�3�9byq���I�Qm|
�Wm�}���u�q�3�zj1�8�0|��4>&�|������MM[@/gfS��O��6�o������2�C7R����F+��n�L|�/�'��b<1��=:�Ztol����v:��;P��R����_�<�5�m���q��{�_1!�q�b{���H��%1}@�Q��61�)�[������V�'��6~��;:�]��������~����[ ��>u��W�C�o�l��nO��9�'�hN�F�79nO�����9�o��u�������7�hE�����q�{^���/6���r������j,Y=#������U��a+�����/$h��y^�����7�;�;��@���+�x�
�@����,���n����5�N���o-m1�i#��
="p�|��!1��"�H��m����p��n�{7�����`��1���O����@�q��m��aKh����~��g��d�w|O�1����Z���?����u��@��~o]����[co�%S�O��w���Y�
����\j�7�����wSs��%������R�}5�~&��G
����
t�>d�}c����1�/�GE?��K����W�}D,&[����Z�������JM��Td���~��7E�*�)qo< lV<���rH��8�1�E�A�+f��������@���cx����Q�e`)x�c�&��1GW�)�	�?���~�G���������|��}	�+�����=�{sf������t�����_�����S����V�1�%D�����<���^t����F��F������?�^��	(�"&=�]�}����#�+<gsx����_$}���I �KK���s�Oz������oE�N��y�a���q�ji����t=���3����������|�D�)->�8�����|.��:��@�Q�&]� ��'[���5Z+=����w�Ei],u�|��M�c,+�N�?G���V��������m.�,��h�F�y�#����{���wH��Q��aJd}��Oy+M��s5�X���=���u�����A�������6���]�l�}���~FL�G�'�LB�U|:;M��D�/���-��s��<y#�Y�O��V�G���a��M��
*���5��d�����W�����gQ�fA���7 �b����l�~\�a���b�����+��~k
dge�y����#v���9�]�s<W|wc�����s��wk��nhs�va�LM�	�}a�����k�<V?*Yh9�=L�:�&!/l#��6��	��o�.��``B��?�,:����g"�Bh��Y��
�e$dj;���2�����+���#�<`�����5���0����~`��Z���;!�g��� �6���w�b�v�w��������	~���^{i-�x��"�^i��.�^�d5|����c���_�NT���@
_�A�������8sk
]����4���:b���<�K�#��(]c���<�,���8����B��Z�����������D�TN�b�7�����@D����#^���#(;?�	��;1�o�I�m�Y�	�~��5��V�u����~�
�s����������"[	�n����)�+XP��#��0�g��G�_����X�)������=&�{���`���u%�{�������+��}
Q��N�s�(�����Q��2�u���D~���G��c�{gpgpgpgpgpgpgpgpgpgpgpgpgpgpgpgpgpgpg��`�7��7TI
�'��h4_���L����=�F�0����8��7��R��C��v���/�����Q`?p�����q���7�X+�o�1� ;����\�s�t7��1jM�QK��V,�=��z��?��{�X���X�c.����q�����.dt������m�����������hq?�����a+���L�T~�O�\�~N�z�qg\��k�qu�k?�����Q�Zd\���0�:����1|���s|>c���T����B��Y����l/�0������.+�w
��w���A�����.@vA#��M'3y�D�v���l�������IdifU;
�������O���s�O����%
�-;�?uif�;�?���l��3g�d��G������?\���������o��jCA3�v���������/�_0�.�	��qSCA��.�,z0z9���<��Fe��f���������v�>�?�?���otW��sll��<����r�^����,I��u���6��6��������z��J�u=�u��uE����.W]��z�n�fv��f��l6�e37��#~�sH�6L�I�$��l�5.��7���p5sD�di0<�/90�OD�lf�
%��E��i�y}}����fU)
�����mg��z�F��fF��5�4Qtu����{��iW_�!�~��������W�������������f����_��\�S���";
�u"�Nd}������<�Y�.zf�����c���mlku���A���J�l[�HQ.�����1�A���v�A;�T%�Q��
���b��F;�]�]j��F�`j��vY���@���OYF�,��m�m����=L<k��f_�Y�J����&�~�	���h�g�hRs�I�x���M�=I�T�MR �&I���~&�
���Y^��L}�����Tmvoc���zo��G��o���#�`��-����|�m3�#&��h�+��,c�Ll������xU�>]��*H��r�?q��-�����m�Wi(v��6��?~�W=���y��q:�G
F
�T�h������pS�z�uK��l����XaWV�BIjm�Zf����
+d���C���P�Q[��g\�A�:�IW
endstream
endobj
162
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+Arial-ItalicMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
164
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
0
0
277
]
4
35
0
36
[
666
0
0
722
666
]
41
43
0
44
[
277
]
45
47
0
48
[
833
722
]
50
67
0
68
69
556
70
[
500
556
556
277
556
556
222
0
0
222
833
]
81
83
556
84
[
0
333
500
277
556
500
722
0
500
]
]
>>
endobj
164
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+Arial-ItalicMT
/Flags
68
/FontBBox
[
-517
-324
1358
997
]
/Ascent
728
/Descent
-207
/ItalicAngle
-12.0
/CapHeight
715
/StemV
80
/FontFile2
165
0
R
>>
endobj
167
0
obj
<<
/Filter
/FlateDecode
/Length
172
0
R
>>
stream
x�}Q�n� ���Cdp�4�,�*U%�P�~���Tc���_�4i�C�vf�f����&������@;c��i���Bo,�9�F����#Y7�`�m7����[,N�/ts��nH��5xc{��867�s_0�
�!��.6z��Y@3�mk�&,���e�/h�����Q���/m�bq	Z=�%X}UgI�v	F��7?U������0��bE�'"���_v[i;�q�����(�=}��o�������iez�^�S������p���������)�d1�5Xc�<|7�U��7�w��
endstream
endobj
169
0
obj
<<
/Filter
/FlateDecode
/Length
173
0
R
>>
stream
x���|UE�?��i��{n/����!7H��\zoR$H� E�T]T�"*��P)�
u	4 .����]�];*�Ey}�]����g�9!�m�����������L93g��<�������@u������sPFO�w�����^����.�}�4�����Q�g����������X��t����.4c:g��E�'����"��)��N�$.�5`�ZLO�3i�<O�c��a���S���5t/��!���FD�� ,���>G|Aa���t�B�+����qx����a/��N`�������>�^K�w��97����1�w,���b����	b���Z�
J}	K�F�
�u#8 z����!��a|(]�`\��������;R���%�9�6��d�L}#�=�t�w�}�!������2,� ,�ub��R��Nc2���	��A�OH�����,�����)��R��Tj`:������2�	�����6�S������yx�������"��S��}bS���*1G�T����x	�l�Ga�l�K��|U�M�A��}k~��)\�����_�8q\n����#a�l8+t�
��`���o
������K�g�pH|XzR:��7M9qF��~x����q��]��b�����������'���&�[���-�$��yXw6�]���%l%���������0Z�%|+N���K��7JZ(]/��oV�h����/M�L��V�H������`=��.8o��C���������L6�]��k�-�!�8{��c+����K������&d
Y���~'< ��a��bP�b�X)V�s�W+����)~$E�CR
��D�[� ?.?)� �P����`~������4A������7��>?�aG!���I����}7R�Vx��q�"��]����Ld3�|�G���=���{�G�o�[��C��>w��^�p��F�*��
w��[�i�$�D����b�8U\$^)�-�����������KIV)&eIyRB�/M�.��K�K���W�O�2GY�4(�e�j��4�4�Tc�����Ms-R��`'<-��Qq��W�	�
�RXx]x�y"L�
H���l�p
�r���y�yl���p��S�y�P6����B�i�O��A��'8.��w{��X��k�o;lg T`�/�����*�#~�L�&xW�� ;.<&�@*x^�@���{q>�v
}�g�k����-�F��/1�0����	\����q\���6E�n�R�>�GqUt�/S
?{Y�!����	|�
��D�7�q����6\�$+| >��?$�^*��/d�q\+`~j9\)����.���\�(r�%b����R�*��=��{7����P�	!�A��b��E>!!��5~r���^-4����!��^m�����R��e�;�#����%����S�
g76]
� W�l��O8$�KuVo���[�/�v.�W�C9���j�o0
�RkRG������� 8�o�
�0@��M��m�~�<|�ad��T�Yazj6�=��I�I��q�+���0U�0�H��4��6�$����nJ�3�g����+��Q��[yYiI����:%
;����dge�c���H8�}^�[u9v��b6)�$
��f�������Iy�t�t�$���"��.�Y�Z�����b��%�XrZ��I�d��$S��P��(�7;^w�Ov���9����������<������
����}�u�6�����W������f����=����Ym�a�.�=o^�xD���M�;U����.���zP'���4�n��q}��efVw,�c�'g_R���\	^z�f���u&�L|�
��V�o��.�M��dO�4a\�8���p'��>u�����%�����V���&�������+�uG�ky7������+���]��^��8xT[n�W�n�&��&�V��M��K9�3�u��^��W������������$w��B�o|��q��uUi����D��`��W�'���w:mS���ns�����22�����|a��2�Q�@$����8�d\6�Sw�L��'w�b�W��V���u������O���\5;��{@
�>���9��%W�(Jt�Ljx���%u��D"��8���x��c�
Bv�<5�����T���?3�&���$\���e#�i�8\������:����3�����e������H��@��������K
x�N�Q��p{�v����#����]]������R������X���81M�cB���"QNh.L�q�:)�S8QOi0��*y���Skh�jkf��Y�!u�j��\5��u=����J���}��FQ9x�������!�i
��x=.3��������R������8d������'[L����G����2����e����]=�!�������z������y}k
�iH��9����j���.
zm�f�FnK�U��������j���z����������$y�@��I�8%`0���.�y��]I�e���3xzr�g6�Ln�<��0O���<�����=�%��%Y��Q`\��5v@�;����B�l\�w6)��K���oS�(�7�jB���`JZ���F���HZZ��.�
�}�P8��w�#]�=������61�,-�-���w6l��E�� �=�"����T�Bi?�]�HK_����������'����U����jeB�8s{**+;%�Q�w��}e�BV�!#��x�������9W��?��V�y�+L/���L��w�A�l�8�=�2^T�'>^ ��1#���x�����4V���X���JK�������������w�����������{�A�Yw���}����o�������������/`/.����������B���?�����/����_��n�`�F-f%Zh"�&CB%X���h:-E\���7J����YC���siy�����`L�s�-��W���������6[�1JC�����r��&��i�k��V
�'������BW���&���"��5�!���#h���.�0�"����17�)v;^���8��r�K=N����I'�I
=��z��G�I�m6��R�v;])�����Y���j�.K�P��+u49LRV
�l�\/;e���z���{���N�O_�6�4�6�;�?+\�v��[�
�U��������C�o)o��uE��K�����w�4l�W�G�\hIff�u�0������C���17���n����^���q�������6���i���k��J���<����f{w�c��j��H�^�C���U�9?��-����,�2f�������76�y����X���'�]2�Fx�I���&�hJ��������}�=�t��'�.����H�h��p��
�$�X������!��U{~�w�,i�p�p�YzJbPdA���.�W�|(�4���}4����*RC�����W������y2L�cP���]�P&����,����,�a�nV�n�Pb�z�OR��D���J��
V0w�'��"3��(���]��
g�{�1����IW_�$����L�^V�5c�Q�B���+�F������Dk�"<�/*$�|]i���V!�W� �q�y���EU1���3���T #��fD�x'�No�� <���`0S���!/-~� ]B1�H����Kh�	�
�=�7������������d�(������@���$��j�V$�G����]���S���*��^2�5
�W�G;g��8��\������i����^��^!M���j���!uG�(�1��G2��YQ^�F|�h���L0G��#Cm6��f��Io��L�[��j5��*�f���7�xi������uO��;����\a)���B�����eWC�+Q����[ec%���cn�
^V:;%�(�H�u���5�f��s��y�������RL�]u����?�t���}x����w�u�]����S{��	O?��r��k�:m����z����m��y��.���>�H{	V���L�A�@��h��m�P�&/Fk#a���mEi���Z;�3�RF���������<q�V[����������,�x*�����q�����
u��@�_ ;������"�i�����L�����+}+�}7�=���q����pJ&��2�S���c�=t�r|��-l��0=i�^��M���b�1��f^�Y81>7.�C�D��LFQ�Q��\��0���<yj��oz�Y����c��u�~��f�p�l�<zmQ�C��Wm~|2Q������8�rb�h9R26�����[@W+�wk��Ab�+�2�>v���[��t��c[��b��5����~���Y��\�������b���[Yw��M����k&_w�
��/]�}��:e<����?#m!�K�w#�v�gw�=u:���C���Y����������������������2��|`3�������_5��g�n_g���]"q�N=�����&�jM(O�WD�'������L�s�s`Fxc�
��E��� L{V����X��VM���o�zam`��S2+���,����%�b�9|e�
y\�Lu0r��#�t6�5|��A2���IZ���g*���S���
�k���������.p�>���@n�f�VZ!fu����JzD5�I��I[�}��
{2�����c���k�?H��_�J���l��f���7�y�@}S9������A�4mBVvW�,��f��B;��z�tz����WRl;�D�A�y�v��W1d(4|i�(�����_m!>�T�Ovk.������E���?�!fsIRF��d`
����%�p�E��8�*\R��������B����#W��N�����O������-���0"Z:��m�Q���iT����z_q�|N���r �Iz�#I�F��t��~�w�Y��� V��%����'�s���m���
S	q�bRCB�`*��q�V.v�����;���Z3�V�������P�F >���)!#�A\���5H:m��%�A&���g��h��3��3�y�}���z�Ek
��Ux����7�����r���l�������>�* ��SMWLh:���n�~�8�P�/?��t(�]�~��c�n�#qc��F$� ��f2���~urb��9� M���$�s2-�b16��uZ��t0���49+���f0�U�7O���J4�-����nK|����A[5���5D[g�YS��'�'>�3:>K�b�b���_d�<z�yE�-���)N����^��2�
���g�3���z9�!`?���;6�H3��H���\c�r
��m����*�,���_������k���uOz3�����9����V���
N�
.
JAn��(T7�����fKC�c-I��&��0�a�������Zr)�9�)/����$��4��,p��H�1_2��
�5���K��{.�o���>j:��M_<�~c���[������"�r��<����7����]}�Z6�-aO�����_���a��[���NB���K:�;���	f����Mg�I�c�(
4����'
�y��k��4Q�0����v��q�z�f~��������ZPq8H�pWh�.8oy�_Q1ew�x�Mw�i:>��k�x��$�~z�]M��3
�>��b/=@��Q�j��j��
��~� j����M!}���Y<]��
=�C3��;W�[�Y|%E����`,����+�7��U�N���9y�9��g���F]�o�����z;�et"��v�0�S'Of�"dx�Qp���g�� �"Ak�e�7]!�n�$�Rb��s�v*��O��5�?�h�| �xEE��Y�����u��E����Oy9���Lj�j�����7=�~Fc�+�;`��J�����������3��aK�E�����Wo��l�����v�}����k�^O����x��x�����<�)�q;d��������i��dl$��g��Z�e�p)$�B	�P���$��q�5..�%����pI�3�g0!�`B�D&6�&�9�f���o�,��x�����]^2dh��q�Y�5k���>��Q�������a~�O*E��//�����3`������d@l���������9+oi:���M�n�d���7�4��U=������[���������p��{:�_�'�����l����8y�
gSC�t�u[7�q��3Pn�.\*�4��v��_����.�y)T��S:�b~�6��5[�\%�l^��p����A�T]�~�����-��I�U�S��C�����"���S��;W�!wQ�V��r���������qC����1���,��i��'jJ��(����O�Q%I�������8��%������gZm�V���[6�,�Hf_�({�8;0'ri�U�k2�Dn�Xx"�'�U������������G�)��O�b6��Pf\�dwN$�0JM�7Fhb��:3�X�X>1����*�����0�m�q�F���ZM\[D@2�*'�\��r���w7�#w�-��&Z�#����WP��hHj���j��6k��e�$�1\>7w�1�H4����K&��fDW���9��e������z��w�WY�x�K���F�W]6d����Ccg1��?d���O��k��i����e�?���5(�q�J}!E���	�'�,Ka�)��(,�pt�wK�Q8���QS8�1����j����G�p�4:����������g���*����s�#H�S���U������p����P����B�((
(k�NL3�H\a_����neN&��9e��L_hb����bg��6�g�)opnu~����F�M�\���D�$����L��9�rr*���y4�N����D���$�"�U��E�&h�:����DE[�I�$0	��������X����iH���Mb�h9���#���y�����Q�rx�0}���i.N:�����u���'W �;���;o�(��TpN���\��B�X�*�Y��E�r�@07�U�i��Sw1������!E�)U���8�i���4�
�J�^A�"��
��U8R�tov�r���*�������'����t��Sb��T�4fs���&n�;m��2�`~.�1]������@ �M0�����C=�)h���Sv�������������]������e�oZ�e�j	f��/�?wB����K�~L�'o�|������Z/�x~�����''
���������/��C��^<����jZ��)&��= �r�j&�]9r��W��bu1!���F{E�������@%r�!�s�c��&���L�l�t�e��"�bo��	������?�����b��\�*�u��\Iy�k�<M~'�{��jW�NI -����u�B��#�%s9q��1���������H���!���qJ�hi^4�6c��F��5�$G���1���Q8��K������y��*9���6D� �cl-����	&�X�VdC���Fk*�(�q�c�jd"=�I��+m���Q����c��������$�������!e6��V"G��]�9�k������R7�3�_EM&_�[�.�_������M�=�g�P6��+�z��+��w7~���^Y��m�[���������$_����q����� ��>����Z�����6�����e��.���$��yh�H����J6����4`&>`&.�i��|a|���h]�H6���Y,���;8�;*X��
�/�/�slV7G�fG�:S�!��/��s,s<j�iy���n�W�?Dg�D�\�R��b�;�:u���6�Q8p�lp��Q��A�."HN+������4��t����$�sZ��%b������D�1,I��T�u�,�)"��a'�'��Q?'7?'=?�w��C&3U������T����I�s��M]�����N �\��X��u���N��_`���b���G�����1��f���M��)��t+����w�������~/�5�t��-�o�y+�1��!���O1a��Mi�f����^�yN?��������M.�
�#�Q�����}������}���
S��������������O����
~����@,��C!�d�$�8:z���B_G?���E���K�*�N��N��E�Mu!�����LGl�t��,g:�4��=�g1��2
�&�|N��z��T��j������r5���KpsqL|����9�r�l4�n'�����\]��"�Z��y4���}����1qE_�g�k:d���2ID�M�)��.�L���4�	S��L8�lDNC��C�X����T95&*�i���p����ANSN�
%�F�vXK�����K�\>���k�.����+y����V�_s��
L\=���<�O�������~�H�Q"e ��#u�Oc���^#�X���������6���}��%/�Xz�;�<o��}�"RO�p�hO��H��H�����I�9�I���b�)�TH�s9�����������UAU������-�J��J4M���./�����l�����`����A;�
)v4�z�%�����������+��YR�b, �0��`�3U�����S2����km�k �b��|�5^�3��u4��%LMbh#,�T��S�)oh�1�74s��q~%��<�>7i6����n�)��,����7�����e������dg��n�q���w����coZ�|���P��YA�M?�������V���(�/��2�
����������dx^�~��'����Q���4��H�,����������+�
X7��/��c�MJA	���5SQ�������K�2~� ���N��
'�z����#Y���H?R��T�7��x��+!9�=��p(����L8��J$N�\p�7t�R����x�x
y+��q0�}e��f����Q�U-�4p+�4�`����Y���RwvyiyY7r�"�&6�����6x#�_1dBZ���:$�[3VY��<Z��^���4\y��F�_����B�b��f�}E�\�[_�bI���|E����A�~���q��������N�E�d_�?$m��"S��������e��0:st�����;�-+z'���o���w��A�V_���(V����e�Z2�5IU�F]��YQ�5�/�-m�1���U��d>��
��Lk���RN�0��s� ���fn���\�<�+�S)ri��8HV���N�6����tN9�\,�b��b��b��b9{]�\�R.)��r
GMC;U�y���QW�h�����QjY��rq��
'�e{N;�L��
����h�|O��1�~?VU�����-���������n�Y�k�����V�{�5�BNvE��'.��-{�zt�����}�^�����Z�������)�����*����5�.;;�_�?)�e����t�O��W���D�.��=�V���
����t`�(=kl
j�k���E~U����X��P���Z+7j�/�+�"�*�j�W�M������Ex�=�dW��������:�>���Xb�y�e�
z�QK�b�YP�3I�hQD�y��2PdE�*�\�
�F�N�'��}�	I).��$i&�@-�N�,%n�HVj_�E2�D��
����Ef�?"�R�C��o/#�H4�$�X0?�s��R���{U}}����Cg�R��w�_��n4�lh��r.[���#�r�Z����r>Z���r����	�����X;����s-���Nh���e.9&o�?���x9!�1y��LN��
b��������<q�}��q�0GAc��?k��n������Yg���c$��4O�ZO�9h�J\��/����^�}���J%9o���.��LNmqFE���H�1|Q#�fD"F$]���e(�fD"F�N��q�q��kU#�1"n#�5�d��x����8g���XRO�9�r�c�1�G�O���T\����PZ�"��Q�OL����HX��eks7�
��`������W�C\���+�j��+X�&���n;W�������Vj7�If��-�*8��r���4�@Zsi��42���@�D
�q�s��
�4;5�f8����J��������";�Ff��*�����Q���U;%�i3�kg
M�d��U
�$���9�
l��������U������F;�t.�8���>��G���'C����`�z �#N����������Oc@�����1:�O*���;�]���<����T���+��]���-;�'\0�w���Y�C��k��K����Lc�����=���x��}���no|[[I�g��l[�+��Wx\mP??��Oy�6*����T������P*$��>�/���Hc����;
�u+�{dc|����Dq!n��
(n��4��8��!���-����\���h�0��F6���:����-Y��,ec��mX��VQY������0/�1T��B�P�p�	p
p�	p~w������9��[�t�9M��N1����%=������&��>A�w�V��J^���R��UX������
�J@q[�f��**j�i�e��S���$JN�;�U�|���k7�P����,|L��gk�yCK�i\(��lN�;^k�CzG��q�f{���c|^m����~��J�0��1Y�����X��|�2�l.S{xz�C}�������	���j��&pah�<�2E�����-�[�q�8Zm��>[�*O���[�Q��FF�3(�gx|��zi�|9i\M��cj>�l�~��e�0yD?�m'�4��G� ��-�lb`RMq4��u��!r+*1�LJ�;9�8������;���p��o����F�pn	��$�9bLpsS?r���K��J��yKJ@���T���5}���@��,��Q�K�K,����\x�����TS�l���wY���o������+Wl�q�������zE�G����e0�k����_};��i���T��vwr�]����V��x]\��;���K�%��������=�=��U�/�ONH�i�e���	�J���~����|�2��S�@��P�r���O��W?�}�����N1%���:m�6"L�Q��2����Z�Y�8'�xR?{�Y�F�a
������%�Y
��h��&���yK�R�e�9�4�Q.�O��
����e��rY�j����a����:��[���Y7��'���[�������j�!~�<�w�-f|��wL_ux��^=��N�G�X��c�nk�!?�z��5�{n:s���g����z��W�F�`@��(��
Q�89�&$���y�`�J�R��
����!�y���2�x�����J����V��,�M���3�K�W!���eO�+�r�@��}�0Cx��n��������
.&9|����T|Q	';�,c�[��u#�e�b�+��u-sI�H��3��F���Huq#���TW(����pi�x�x
�p��oCZ~-��#�`Koq�7�5qk����)����3������5,���?�Z�������nv��
�3���o��q���j�|j��G�^q��M3�y�X'f��t�����->}���^z���HS�'���>H-�2Ub�R��[%M�I��m��-�����l|!��R����Yq/�
Y�>���K3�j[���B�W��B�*�Y������sK�;�<�����4�yL�9����W�q��W:��O����Z���1����fT]��z�:�7�)o��=��_U���M�sU�q�Lg5���)�a~-hq�R���82�Er�H��2"�F�e&�rK;����2��'gl���%�[-7�<�}���a	FB�����
�i�APK�54�<�2�:�6�>�1�<�2�:�6�>�Q�W���������5g���6%oJ���E9�r��>`������:o�>a8s������eD��H���W1^A1^J1^S!���'�b�9?�n�"�<�d��!�TV�����U�������Ca�����?K��ma!�<R���{��>*��$Tv�	�T&��x�/P�=���]�X�	�������$i��������?Kz���h'[,�"9��7TVB���'2�]i���Ws�8���V��j��������.S��g��%��3Zq��R�T���,���)��j9�=�0�{��_XV[��D�*YV"��3<B�i��=�
�����~qj�Q���
�9..]\������N'�8�sr��y��g����5$@����F.d0$������a��}"1�<�-����������
{��Q2�@���w�QOM�w���}Eyn��zUQ�r���R`JcrG�d�0���N��l������
�-V%!�ALM'�6A_�jn&�/_��%i��������������NByY�n�@l�i#?^0C��������^��<���
�����Q�<?�]g_8c��@�8������q��Co����L�s~v(�d��a��,�%\}i��	v���{�9�=�L���������
�� �b�}G�q
��8��d�]�"���g��Y�JFadY��;�L��jI����6��Y��J=�j����L������y�e��&	P��h�3�36)&�y�
�4�G�����c�R�#\�i���.�
�1E��5c��[�	!�u��6^TX�k��z�$�q:�K�]Z��L��hnP��"g���������Kf�p���;����M��>$L^�L��nY�x���y��W�G�]�"�%��	qo����O���^�c�����s��Ai��K��h�K�� �n��]�pGw��� ^�f�4��]���,� wW�"u����l_��Ehb������"#u�TD�
w���O��KR{��Y��W�q�a�Q�d1��Y�Zx_,V��.�����[������[9���j?6;5���BTV�G�q1G$��p9�dV��MO���;
��,,\�z
���0�q>K���h)v��X���o���m��F�������^�P��q�-]��u�*���;8�.��pFU���;�i����;mi�S���E�]��H�E��.��]��$WU��!�=��0WV�B!A5"\1�u�t��>�:�>y�������J*"���wD�r�����j3{.��M��q�{���{m�

�?�^q���#����������?%����v�H	����b.X��B�|%�'	������(��l�0E�������`.�C����hW��KpY�p�"��`�XD�q���v�g��V�Ee���������euMR,�&�
,���2BY��J��;����
Y�q���������k#����������'�(�r��)Q����+;�5��..�J���Nu�v���T++�����W��Pz�����^a�
V�Jo��P����
��YaIF+���������,e��u�n�0�-�3������-�������������K��5���s��3M����������M#�o��"�,���������+�_]��d=�,��{�b�W�VB3����-M����������>�h�a�G�)����%�q�(��������HX��k�\d��pE����9���X�V��t$q����T���O������������r�}n[���; P���V�gxfxg��T�p\���w��F�j���M�{�������w���~�������KE3�
xm�4���u�Kt����i����`��\v���XA���\���	���������Y�t���� �F�����m�v�p,��at�V�Iz�������z=�bY�7�J��h%�����vq�=e�XbG1����KP���5�G����C��ca���x$��1��1H��$�a�(L������u�Q��B#��{��`K}���}��A_��g�UX��U8q9��W���:��Y��N��A�������z�x����3��Z�;��r@��'�����~"+����iv���K��5]��Z��6��.4�w��%W���yk��Q��9���a9s'�*�7�������'���:��gNom�T���������L4
��*��z�*���?G��g�����N��FZ~�@S:�V��J�����I�)X�e�`�'���.���Q����yC��(�ewQ]|�<z�xs��cS�����K0
��I���J���f��W�Fe���������1o=��0�	��^g=n1�a

�w�����o��G�*-L}��R���X�m���b0��b������K���>�p=����}�p>�F�_��r0}=�#�C"Q <��`X�������`:�s�;a��>�Z���<"[�H}���E����6(��2g!�#��0G��>�S	Hy4N ����0L3��(��Q1�ca�Qz6�'�;��J��c
�w�)(��\X����������)0���a��)���5����8��`z9�����YZX�?��2�l��_LcN���6U`�cXf����D�T����ru:|�\a�[p\�b(!����t����0BA�#:!>E<������(���9�"�mr�@��_�1��q���a=�Om�l��E�d*O�,��LZ/D���m��iM�!��Y�����$�jq�I_C�_�H[FH��L��n4YWa���z�Y������1�5���-��3_#��:�_o��X4��a3>�V�y�F -���p�t�������}�l��5\h��8��1}_��^���)���|��<�c:_:"dIG�,?��R����p-��(l�O�G!����4��-�I��O�����R�>w��0}�:#�F��������<�5�����D����CNB7i���<��#{�[�&�H�m��	G`�������a[�[p=�����t��������6$���TC���:��8���h0�d�d�g.�G#Vh��:�L�/�#�l�g:���>�m��m�e�wc�b?n2���#�8���������-��� >��u��c��c}�#���(�R��S�S�����J	����S��{/n���RM�<�`�R-l��Ka���6s~�������e+,����#�����AO��,��|����+q=b>b�	������]8�$�n���wQ_�������
.�����P�RHy�E���5�Hc����)4W���{��h���O�.�X�V,���A�tAug�J�ca�&��aX�����I�G��|,x}�E��i,���.�����A��dZ��1����8>��7����"\^���Z��i���?>uF|�g1�u����I��pg�w�#i<v%�q��(w!&}�.X-%��2n��[d������w����vo��1�o�}�S�*�eHG��bJ�WY���} =���M� X�t��|����I���D
<}��5x���,ST��J����
 E2t��fHc�D��k�
���Z��]0Qz��`
�%/�u����[R�!A��_1}/��*��*�L��mH{o�U��s���[�Nr��w�\�/�����?���r����Eit��Z���@�>�����b)������~}�������b=*#�O��U�=D�6�n�'�w��8�d��v���k�-�R9[��$�����/����Zu�rxq#>��� �� ���b�z���W�{-A��T~K�i����;Q� ���nB��8�]���������Ae)�LW�O�����&-��z�	9"���k}�%�_���l���|`�7�^�0N�.��}���w)����7��h����{�eo�*^�4��tGL{��4�	����m�ih����M���_K;`bKt�Lw��
�#���/��E������c���P(��>!
��8��|���}�P\s���!�*��;�?��.A�G{
�|��Z�kWWq�v��c^���/)�1����Q2��k���m�g���*�fmt��g��	�v^A��8��v[�V*By��*�#��~r1�&nD^r��(����
�Pz7u@80���K1|���_��G4�)
6�ze������������3�����Z�3[31�_��g���������z7`��v�q"��@�����������1,Bx�7��������O��n�:�d�g�|^.ikC���1����5��������M���L��W����%�q>�ZB�jD��Nz4���?s�Q����Xl�g��;��J�3��n�>��g���_��h�[�IX�Piz8�� ��^G��B��m��L;c5���r����|�{b:��
�f����_�i����TF�dj���m�s���Hh+��S���������-���i���,@	��L�&��K��J�����4�V����m�#�?����}&�f�Yw�)���v����>�]���MO��m	��}����t����w��,����L�D��l:�����v��S�����t���c���s[�%����8f����C1�<��
1��k�!�����d�J�S�K�#������|���va�������o'�1��ch����}�9_�Q����}��`����#��J��>�&p��|�z��1�O�i?��Lq���t�\�����PZHv`�c���,������D��}�>��L�)�WJ'P�����"M�>�*I��C��S�W�����0���o�Y��z�K��<���������z�{�t+<�$���C��L�	��2#�>M�����`�cZ�k�3�CY��Ci~�_�mP��FL��+R�~���zM�{N�e��2���~��E����>��[�iB��}1��qi�9]��M0~��i�=����������K>_i�d�a�j������cy���������7��Hx���5z��A�N��w�,�z������)�U��iXo���!uL�9�_-���K��p��*�ct�{�ai
��>�s{�!���������������|��X��v�#��:�=�x��#�����~�$��
��; G����>�uQ��A8�.�^�2��0Yt���z�}�!j��+�����a<�	�j��p���B��%L%[X��OX����8�U�3�3���-��R#����{�����}�vD�_`�Kt����.�5����RX��Oan[�����|
{�����~�\�����������/������������?�?����sN[`~�/�cX[`����@��vl��M��������!"�5��8��iz��z�{h���C�����y^�l����vuj�94��aT;�a���Q���Eu�������l���o|�?�I�i�������0�N�Uz�uZ���<W�)��#�Ww)q!��a8��vjH�	��#�/���/�g��A��,=�_���:��(�}�-Z(]
C8�=�JV����x�����J(Q��<�Ho .O��o���l�O����Y�Q��!,
����������b|_�M|�t�&��{�|O��N�Jk=�_T,��>���{�f[%���S'L�E��	�W�U�9�W9�}=�P^���P!_�V���z������"��q�3�ZV�^wF��u3�n��7����+��g�!���_��$��}����d
��0���Hh���-���P`���U�,!�����PPOM�}�i��w��'���E^	y����q
V#��8����&i:�=|_K�4��3h�m���m�C�j�)tA���xI~6����74��>�O���}<�i�}��x���t}���D�i�
%�q��(���c�!��L&���!-����(��������"h-����"���)m=���|�61�q��?�1C�O�RK�x������%Z�F\�)��Z�j>����xK=U?K��G���{��~���>4Z�t��'����wb8�H���!��;�n�zt�P���\��\7��=|�h�t��a��+?w���Xm�a�s/F�=�k>��+a�s2��TJO;�]���s��O�?�|r�B�G�S���	��K�� ��Ogs~�g��Ch��:O�SPP�L�[C���mXa��E��	���R���Z�C��-��no��s����>���.��Q��e
\����	W���C�d�/���[7H����h���g�����l��?��t�O������-���3B:���d�q~8�[?K��:��O"^�q'�J��*�S�������u~D��mJ����EA��������?5>������q��v4��S|�~�v���r,#`�~N6F��.����aZk�/5J��S�����[^��WS�W!O8����u�u���ul�t��V����o�'Zm����"��m�chh�\�?�/������8a~�4������g@X����@�M�$�I2C�����Y���%��jp���o���|
��A�394OPP���a�p��,���B����{�L>��:�^$�E!G��eG��"����B\��3���8
������1>b.�C���W@���4�G��_�PF�^��t��@����LA�x
>O+w������ok�������D�(D�W���Xo�f��_���{F��2�q�g��/���n�gj7�*����9u �q�_����!p�R�1����\��O����0C>:�����tp*�Sp�\��cO��y����yb~��H�u��m@~�������
O�����c�<�����=�id�������k���:�����������kO���H�@����
������0
yC���B:�F����X5���W�kS�h�BA��x��5�4E���!�g�@����	�_�o	m��~n����f��Y����=��g7~�,��������~�*���P��U���`�Y���_7$H��T��nm���
:rt�(�I3�eHk�O��<�"o����Sg���?�|s���!�DD����?H����&���e0�x-�T.3�l7�i�o�o^�R����� ^�������x��B��Sz�E(�w�S��J��y�S�Y"��������U���������������h���inS�y�'�l:��I���q�=�f?�h
�w0��/����m������B���z����nP�x�L~	:���f��������+�>� ��|����	�H��gm�����5lz�o��f�p�.�i?G���rwoy��Y�:�����0������a���y�z�,^��Pg:�������f?C�2<��2c�y��Mnp�"�0�/�ooi~�����!�f{h�h�%���u,�!Vvk����_�@���S���">���7����������P��F��X+�����q.���(C���F���G�Y8�9h����*������t�UQ��������@7LWb��Q�U�?����>����v���hG;���v���hG;���v���hG;���v���hG;���v���hG;���v���hG;���v�_F�G+�*�A0�*���*=e�� ���b���P���Eb�����.1_L�~^,� f���K\=;�q|Z1���:��!�D1�U�.E,ClE�EF(x��q�\��Q�#�������3_c�0��%�[D
!B��������
/G9sK{'����~G)�=��f��9��''i�	5<���j-:R������u)��;����"-���,���(��3 �%��yxe�~p11�(��!����=;r�J6�%`� 2���>�mw�KzZ���-x &|#���w8�%z>����Q�	�R�(�9^�{��"�(�>������bDb"bb/�[�Ix���Q�R�
!��U���z�.���#��]{��i��b�����wKwJ)�V,+���v��t�b)n��]v-�-m%;�����H����uH$�Q(�!<"!�NM��$`Q*�1�����<!<�<����-k�L���{�w����}i��4�OG��;J8����;��.��y�;J������T�V��S���W\����_q�o�����nd7�
0��&f�IT���n��E,�$p��2X��y�:p�l�����q�
G�)�
����F��g�
���
��}-�8�gW��B������}�:����&k��l��]F��:�
���	��]f��^����y�@��;�?%�$bPm6�����������Ghr��x7�[���F{�(<n�����F�{7Zo?<n��x0.����g�H�A�FK�(�4�*��J������<,�k����E�N��Z����%juQ����5F����B�]���%S+@-�Z��������&�O�yj]�V�Z������J#����K�%MF����O�RDE���A<.����e@�V���8WO������E��9t��m�#��As�Fsd������.�
����'�-����~�M�.P(�s`�P~��������[lg5� U>����{&dZ��l�EHE!��L*si��������h;�&Hn��<O8��~�h�J�i�	`��&��0���E{3�%�
Df�����Jm�2CW�^S�C�w���2����/�[@�gD.L)7�q�Z�+!rIs)hF�i�Q�8/�G�8�(c���7�v��,C���4ZF����*�1^\��i�9�����-9�f�gJ��%�9��]+�IC1�+��3�yOyS���z�:o��x����UR���VJ+���$J������6t���U�>N�_�))��q�r��dTb�eb?�I�dw�&�+�$�G����t��^{Y(F��$I���F=�z�]vDO���WSJO����m����K�<t��.���&��;^���c�M��+�����[��^�/a���Wh�����S����Y�i�s'[e&������4�O�L���=Nfj��J�'�x��7��Kw
Q�=��c�	��/f�#���N�ta����tEE$,t��"�+�\�I�$����Y�����W��k�����BSa�y������nY�$ 	]Cd!��!��HR���/J��L�H#�4%�4%�����1�u:�b�%�B��Pb��9��o[{T53h��j{��=��8����P���LK��>�n	�3�/����Cq��hI�v�������s�/�����`�|�>W{d�t����\>W���n���������Db���O������A3V�;�*6oK�?V9�_+�I�n�+B1������Q��g��V"\�O��Z��3�|>�C�,#��Hz������4��Gx�sVO���\�6v����$�������7��z��d7/���n�J.��f�x�<����������<���Zlv�:L���${=������~K�����LS����/��7G��0<�����s�'��J�x�b�����c�8i
endstream
endobj
166
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+ArialMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
168
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
]
1
7
0
8
[
889
]
9
16
0
17
[
277
0
]
19
28
556
29
67
0
68
69
556
70
[
500
556
556
277
556
0
222
0
0
222
833
]
81
84
556
85
[
333
500
277
556
0
0
500
500
]
]
>>
endobj
168
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+ArialMT
/Flags
4
/FontBBox
[
-664
-324
2000
1005
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
716
/StemV
80
/FontFile2
169
0
R
>>
endobj
170
0
obj
310
endobj
171
0
obj
14940
endobj
172
0
obj
294
endobj
173
0
obj
21471
endobj
1
0
obj
<<
/Type
/Pages
/Kids
[
5
0
R
12
0
R
17
0
R
22
0
R
27
0
R
32
0
R
37
0
R
42
0
R
47
0
R
52
0
R
57
0
R
62
0
R
67
0
R
72
0
R
77
0
R
82
0
R
87
0
R
92
0
R
97
0
R
102
0
R
107
0
R
112
0
R
117
0
R
122
0
R
127
0
R
132
0
R
137
0
R
142
0
R
147
0
R
152
0
R
157
0
R
]
/Count
31
>>
endobj
xref
0 174
0000000002 65535 f 
0000247416 00000 n 
0000000000 00000 f 
0000000016 00000 n 
0000000142 00000 n 
0000000243 00000 n 
0000000408 00000 n 
0000204030 00000 n 
0000005716 00000 n 
0000005736 00000 n 
0000208488 00000 n 
0000208641 00000 n 
0000005755 00000 n 
0000005924 00000 n 
0000204186 00000 n 
0000012637 00000 n 
0000012658 00000 n 
0000012678 00000 n 
0000012847 00000 n 
0000204329 00000 n 
0000018414 00000 n 
0000018435 00000 n 
0000018455 00000 n 
0000018624 00000 n 
0000204472 00000 n 
0000025115 00000 n 
0000025136 00000 n 
0000025156 00000 n 
0000025325 00000 n 
0000204615 00000 n 
0000031483 00000 n 
0000031504 00000 n 
0000031524 00000 n 
0000031693 00000 n 
0000204758 00000 n 
0000037431 00000 n 
0000037452 00000 n 
0000037472 00000 n 
0000037641 00000 n 
0000204901 00000 n 
0000044403 00000 n 
0000044424 00000 n 
0000044444 00000 n 
0000044613 00000 n 
0000205044 00000 n 
0000049745 00000 n 
0000049766 00000 n 
0000049786 00000 n 
0000049955 00000 n 
0000205187 00000 n 
0000056663 00000 n 
0000056684 00000 n 
0000056704 00000 n 
0000056873 00000 n 
0000205330 00000 n 
0000062556 00000 n 
0000062577 00000 n 
0000062597 00000 n 
0000062766 00000 n 
0000205473 00000 n 
0000069681 00000 n 
0000069702 00000 n 
0000069722 00000 n 
0000069891 00000 n 
0000205616 00000 n 
0000076157 00000 n 
0000076178 00000 n 
0000076198 00000 n 
0000076367 00000 n 
0000205759 00000 n 
0000082801 00000 n 
0000082822 00000 n 
0000082842 00000 n 
0000083011 00000 n 
0000205902 00000 n 
0000089846 00000 n 
0000089867 00000 n 
0000089887 00000 n 
0000090056 00000 n 
0000206045 00000 n 
0000095913 00000 n 
0000095934 00000 n 
0000095954 00000 n 
0000096123 00000 n 
0000206188 00000 n 
0000103063 00000 n 
0000103084 00000 n 
0000103104 00000 n 
0000103273 00000 n 
0000206331 00000 n 
0000108788 00000 n 
0000108809 00000 n 
0000108829 00000 n 
0000108998 00000 n 
0000206474 00000 n 
0000115850 00000 n 
0000115871 00000 n 
0000115891 00000 n 
0000116061 00000 n 
0000206617 00000 n 
0000122183 00000 n 
0000122205 00000 n 
0000122226 00000 n 
0000122399 00000 n 
0000206760 00000 n 
0000129409 00000 n 
0000129431 00000 n 
0000129452 00000 n 
0000129625 00000 n 
0000206904 00000 n 
0000136469 00000 n 
0000136491 00000 n 
0000136512 00000 n 
0000136685 00000 n 
0000207048 00000 n 
0000142931 00000 n 
0000142953 00000 n 
0000142974 00000 n 
0000143147 00000 n 
0000207192 00000 n 
0000150087 00000 n 
0000150109 00000 n 
0000150130 00000 n 
0000150303 00000 n 
0000207336 00000 n 
0000157312 00000 n 
0000157334 00000 n 
0000157355 00000 n 
0000157528 00000 n 
0000207480 00000 n 
0000163874 00000 n 
0000163896 00000 n 
0000163917 00000 n 
0000164090 00000 n 
0000207624 00000 n 
0000170770 00000 n 
0000170792 00000 n 
0000170813 00000 n 
0000170986 00000 n 
0000207768 00000 n 
0000178321 00000 n 
0000178343 00000 n 
0000178364 00000 n 
0000178537 00000 n 
0000207912 00000 n 
0000185078 00000 n 
0000185100 00000 n 
0000185121 00000 n 
0000185294 00000 n 
0000208056 00000 n 
0000191423 00000 n 
0000191445 00000 n 
0000191466 00000 n 
0000191639 00000 n 
0000208200 00000 n 
0000198722 00000 n 
0000198744 00000 n 
0000198765 00000 n 
0000198938 00000 n 
0000208344 00000 n 
0000203987 00000 n 
0000204009 00000 n 
0000224193 00000 n 
0000208787 00000 n 
0000224619 00000 n 
0000209175 00000 n 
0000246750 00000 n 
0000224829 00000 n 
0000247129 00000 n 
0000225201 00000 n 
0000247328 00000 n 
0000247349 00000 n 
0000247372 00000 n 
0000247393 00000 n 
trailer
<<
/Size
174
/Root
3
0
R
/Info
4
0
R
>>
startxref
247698
%%EOF

run-all.shapplication/x-shellscript; name=run-all.shDownload

run.shapplication/x-shellscript; name=run.shDownload

#27

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Tomas Vondra (#26)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, Nov 28, 2023 at 4:29 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I haven't looked at the code, but I decided to do a bit of blackbox perf
and stress testing, to get some feeling of what to expect in terms of
performance improvements, and see if there happen to be some unexpected
regressions. Attached is a couple simple bash scripts doing a
brute-force test with tables of different size / data distribution,
number of values in the SAOP expression, etc.

My own stress-testing has focussed on the two obvious extremes for
this patch, using variants of pgbench with SAOP SELECTs on
pgbench_accounts:

1. The case where there is almost no chance of finding any two index
tuples together on the same page, because the constants are completely
random. This workload makes the patch's attempts at "coalescing
together" index page reads pure overhead, with no possible benefit.
Obviously that's a cost that needs to be kept under control.

2. The case where there are 255 of tuples with distinct values that
are clustered together (both in the key space and in physical index
pages). Usually they'll span two index pages, but they might all fit
together on one index page, allowing us to descend to it directly and
read it only once.

With 32 clients, I typically see a regression of about 1.5% for the
first case relative to master, measured in throughput/TPS. The second
case typically sees throughput that's ~4.8x master (i.e. a ~380%
increase). I consider both of these extremes to be fairly unrealistic.
With fewer array constants, the speed-ups you'll see in sympathetic
cases are still very significant, but nothing like 4.8x. They're more
like the 30% numbers that you saw.

As you know, I'm not actually all that excited about cases like 2 --
it's not where users are likely to benefit from the patch. The truly
interesting cases are those cases where we can completely avoid heap
accesses in the first place (not just *repeat* accesses to the same
index pages), due to the patch's ability to consistently use index
quals rather than filter quals. It's not that hard to show cases where
there are 100x+ fewer pages accessed -- often with cases have very few
array constants. It's just that these cases aren't that interesting
from a performance validation point of view -- it's obvious that
filter quals are terrible.

Another thing that the patch does particularly well on is cases where
the array keys don't have any matches at all, but there is significant
clustering (relatively common when multiple SAOPs are used as index
quals, which becomes far more likely due to the planner changes). We
don't just skip over parts of the index that aren't relevant -- we
also skip over parts of the arrays that aren't relevant. Some of my
adversarial test cases that take ~1 millisecond for the patch to
execute will practically take forever on master (I had to have my test
suite not run those tests against master). You just need lots of array
keys.

What master does on those adversarial cases with billions of distinct
combinations of array keys (not that unlikely if there are 4 or 5
SAOPs with mere hundreds or thousands of array keys each) is so
inefficient that we might as well call it infinitely slower than the
patch. This is interesting to me from a performance
robustness/stability point of view. The slowest kind of SAOP index
scan with the patch becomes a full index scan -- just as it would be
if we were using any type of non-SOAP qual before now. The worst case
is a lot easier to reason about.

I'm not convinced this is a problem we have to solve. It's possible it
only affects cases that are implausible in practice (the script forces a
particular scan type, and maybe it would not be picked in practice). But
maybe it's fixable ...

I would expect the patch to do quite well (relative to what is
actually possible) on cases like the two extremes that I've focussed
on so far. It seems possible that it will do less well on cases that
are somewhere in the middle (that also have lots of distinct values on
each page).

We effectively do a linear search on a page that we know has at least
one more match (following a precheck that uses the high key). We hope
that the next match (for the next array value) closely follows an
initial match. But what if there are only 2 or 3 matches on each leaf
page, that are spaced relatively far apart? You're going to have to
grovel through the whole page.

It's not obvious that that's a problem to be fixed -- we're still only
descending the index once and still only locking the leaf page once,
so we'll probably still win relative to master. And it's not that easy
to imagine beating a linear search -- it's not like there is just one
"next value" to search for in these cases. But it's something that
deserves further consideration.

--
Peter Geoghegan

#28

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Peter Geoghegan (#27)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, Nov 28, 2023 at 9:19 AM Peter Geoghegan <pg@bowt.ie> wrote:

I'm not convinced this is a problem we have to solve. It's possible it
only affects cases that are implausible in practice (the script forces a
particular scan type, and maybe it would not be picked in practice). But
maybe it's fixable ...

I would expect the patch to do quite well (relative to what is
actually possible) on cases like the two extremes that I've focussed
on so far. It seems possible that it will do less well on cases that
are somewhere in the middle (that also have lots of distinct values on
each page).

Actually, I think that it's more likely that the problems that you saw
are related to low cardinality data, which seems like it might not be
a great fit for the heuristics that the patch uses to decide whether
to continue the ongoing primitive index scan, or start a new one
instead. I'm referring to the heuristics I describe here:

/messages/by-id/CAH2-WzmTHoCsOmSgLg=yyft9LoERtuCKXyG2GZn+28PzonFA_g@mail.gmail.com

The patch itself discusses these heuristics in a large comment block
after the point that _bt_checkkeys() calls _bt_advance_array_keys().

I hardly paid any attention to low cardinality data in my performance
validation work -- it was almost always indexes that had few or no
indexes (just pgbench_accounts if we're talking pure stress-tests),
just because those are more complicated, and so seemed more important.
I'm not quite prepared to say that there is definitely a problem here,
right this minute, but if there was then it wouldn't be terribly
surprising (the problems are usually wherever it is that I didn't look
for them).

Attached is a sample of my debug instrumentation for one such query,
based on running the test script that Tomas posted -- thanks for
writing this script, Tomas (I'll use it as the basis for some of my
own performance validation work going forward). I don't mind sharing
the patch that outputs this stuff if anybody is interested (it's kind
of a monstrosity, so I'm disinclined to post it with the patch until I
have a reason). Even without this instrumentation, you can get some
idea of the kinds of issues I'm talking about just by viewing EXPLAIN
ANALYZE output for a bitmap index scan -- that breaks out the index
page accesses separately, which is a number that we should expect to
remain lower than what the master branch shows in approximately all
cases.

While I still think that we need heuristics that apply speculative
criteria to decide whether or not going to the next page directly
(when we have a choice at all), that doesn't mean that the v7
heuristics can't be improved on, with a little more thought. It's a
bit tricky, since we're probably also benefiting from the same
heuristics all the time -- probably even for this same test case. We
do lose against the master branch, on balance, and by enough to
concern me, though. (I don't want to promise that it'll never happen
at all, but it should be very limited, which this wasn't.)

I didn't bother to ascertain how much longer it takes to execute the
query, since that question seems rather beside the point. The
important thing to me is whether or not this behavior actually makes
sense, all things considered, and what exactly can be done about it if
it doesn't make sense. I will need to think about this some more. This
is just a status update.

Thanks
--
Peter Geoghegan

Attachments:

index_walk_dump.txt.bz2application/x-bzip2; name=index_walk_dump.txt.bz2Download

BZh91AY&SY^� q������@x����������R��o�`��z@<
c2(��U$Q�0
���_
�����}@�T����� 8
 �'!�h�}h9�{���0�l�L�!)�!LA ���2����i�X�Z��f1��X���g��=�qT"��� ��u�5��tb�0����	�42Q��dW�gz�;��(*�Kl�!���(�}��4��&�6��B@KC�`D�����!6���8�`4
�����"�0��RP��1�	^a��JUTP'�}*��W�<$����R��Z�`a����@@(T��J��j@di����M��M�D��
�4h�j�$�@L$�J�`���@b
4�j���!4�
4S&��=M��4�Q��Q���S��!����D ����hhi�4��W������E6��T(_�H�������,�������ld�
���jiU��E�-�R�Z����M�?9�dS��U�������W�5F��V�EQQ����X����(��+���b1�5����I��"  ��!#A%%�DA�"�"�""" �" �� �""A@b""" 1DDD�"#D�����  �L`�"	#A&""!0DAIDDDA�""LDA& "�I��D& �""J ���DA	�D@I��""A��	0DDA"0DIDDA�wA_�mm[��dS����O��"�)�E9�����?g~��i?��~�t�������)��xJEv�DSjR+=��"�?�N�TU����������;8��|?����E|2�����I'��	P�-D�$AmV�lV�����lb�P�-V�mV�mV�mV��[+%d��lVP��mV�lV�m	[U�Y���VJ$�-��������!�dQ,V�JI[*���������[�$�"hH��II+e[U�[U��j�����j�+j��i%$��mV�-��%KU�[U�[U�[U���VZ��(X�j��j��b��j��mb��RJ�D�[U�[U�[U�[X��(J�Z�VY`-����"j��d�(H��+b��R��AmV�mV�-���X��"$�-����U�[BD"���j%�������������VJ�Y*��[BP�-V�mV�m	[U��a%$��dQ-V�mV�mV�m	�hH %����U�[U�[U�[U��b�����j��RJ�V�lV�lV�-���Z ���-����U��[X��e	 H�������������U��j���j%�������"j��m$����(Z�b��j���-����[J+�,�����wv��b���mV�-���Z����2U�[U�[U�[X��(X�j��j��RJ�VD�mV�dAeQ-V�mV�mV�JI[+%d��m	�b����j��j�+j��i%$��mV@��mV���[U�[X��(ZII+e���[U�[BD�dB,�,	[	)%l�j��j��hH�[U�%Q-	[U���V��$AmV�-������������������U��hH X�+hH�������"m$�����b%����� ��j��m	D���VJ� ��j��hH����,�b%��#%$��mV�mV�mb���-D�[U�[U���V���b��m~/��>�JYKl%�f��YY���V[+5������YVk+-����ef���Yl���k+5�f���Y���Vl��i5MF���4����N(�1��+�����TU��q"�V�W�"���*IWz{6���@}���1h;w*
��m�R�n��r�d�c��r�G)hD���PV����D��y��H�T�n�0X�������4+�'\X;!�K�$���m�����6U��������%�!;�@�Wd M�K%d N�K�m�'s�j�@�1A�	���cm��r����]��:b������v��vBl��	������m����Rgf��!��w;!~1��������`vB�@vB2]������E`;!t�U�]���7<AN����;!6��d M��]���	�;!l�@��n������!{�{�s�6U�gd6x��@� ��	�[-��$�Y/���>�|J��!m����y�#�'`��6U��^��<����i.���[%���d�y�����vB�;!���@�@d M�;!;�N]���@��.�����Q6d(l���:��&�6������r#�'bd&�6�����.�m�������@��gi���.gL;!�A�����n�@���@�;!6U�z�ge��b��AyvB����&�4�������3 ;!t�;!6v�q�;!v�{�s� v"#�'a&���;!mA����;K��n Fp��S0��6U���;n O���$�y<����vd M��gd6x���@�0��;�@�i/m���d�L�N�d @�Wd N�&�����`vBl�@�;!���vBC4�e]������.��/s�6U�h�@��v�q<;!���M��gm�/s���m�w;!��!6���!l��
�vB��!�;!6d M������@�m�vBC/1���C�vBv ;!l�FK�v�qw;!l�@�d�����x���@���;������@��g�;����qA�d�K�����$��@�Wd'c�d�H�x�	�]�wv��P��3(��N����	�I6����@vBS4vB�;!wIE�&��M����	�Is��d C8�w$��@�;��:�@���M���p��:�N�*�!+�r�(��	��B��� D�Bh���!;�@�d'�@�E�	���;���J�{!6vv;�H���:�@��&��\$�vB��d&��!.����bar.���	�%��4����.�@����N�r'�L��6B^��&� @�.�M���q�����	�%�K,����!4��\$M�@��'�����d<��d M�m�N�1��`vB�i)�B��	��m�� k%��X��@�;!���7!t;��4���gd&��]��9c���	��Vw����&x����	��B��r���	�wr@����q��av�;��b��@�ggU�H�!k��`d N;���]��6d M��^;�����4vB����z/!q���`vB�Gd N;����"��;������"w!<�K<�6�<� N�r�Vv��6vB��	��Lm��� @�w %���	�vB�w!;�@�E�	�
%�!i��	�rt�wvB�v�!�� D�B��	�vBl�6��"w!��&�IW�!vSd"���@4��(�vB�;!l�M���p��;Kl��[f(�m�@�s�1�6B]�Iv�!��!6vBlwp��:�@�ww$��N�r&��N'wr@����)7u���Si(�����\;!2����	�vB�vBq����d O$��6���l�r i�nI�����Qy����D�B�;!m��x�U*�9 ! �@�����p��X@�v�sNB�	@�q�b�p@  ��w.;���XK
$���@9���(p@`������u����@D ��.#�x�����"�@@
�w,U���J�w*����NH�w*<��
�����g�;�+�+��kV<��x�r�@�@*m��T&��������VX�;<��,��	]���5�8p�<b)�dS�E9�S�E8�"�?��!��DS��_���?����!�?������q8���e����C�A��
Y��3���j�0k�w��x:��,=��*\����Fe���^*�=@a�Mf��	����������T��A�*�DU �E�a\Q��p\VJ����)&b��%L<��L�@�w7A����!�I�f;5�"0��B���*��:u�Y���[D3M���J����)�b��)b07�=`��cQy�B�-�X(� �D�
�'-68d�s4�QSH�8�rZMJ�o	�*Bbi\�2h�/Bh�x�)xA���A���j�j)�S��"��w�t�ME"�w�3PL�2&^bE�E�2@�����"p�;���i���I/��(*���0��*H^�1���S�&�3OR��"h��.�!��$�&)��0P�h�"m��DI�TEbv��a�����)���I���Y:�f�*�YS���M��f���T�Z�T�
U�[AMy�i�'$D:cB���p�)��q�X�3���f�����&�)����z"��-k�_9�O�"�*��jE/y���O���~��s�����_w����g�E5UmUk^�"���"��MM�S�TU���S��"�SWa���%�������E=6��|5��X�w~����{��i���EH�Go�E:w\|�*y������>���W��U+���4���-�����(����U�D%*P�)n��j��b����J�*�*I9N�w��z�����Sz�#u$�^fdZ����)R�WkRJ*�)R����W5W9���I$�s�(����Jn��I%1q
T�FE*UU1Yx��i%
�2���I$�(�Z�I�T�Wcv�R����%*�"�E$���UITL,L����FAd�,��+�ZH�N52�)��)��jq,���M���N����)��NT����5��S����*�������|Qm�jE9����.5O6�
�EkV�Uh��[Qm�Q��*�[n���K[V��E]�O^�9���N7o��m��)�H���\����������y�S��_������+UkEu��n���{U"��"�h�TU�������z{�q�.z9���N^;�����t��)����p"�G��E?O��}�������������D�����7rD��%��gn��2 �I(����|;K��s7vS�kbjx:I���Y��BI	$���%d�?��O{�z�����b�!$��$�gn�s1
�I33bI��.I(w��h�4]�Rs������s�!$�I������I����N�J�s��
(���{w�N�)]����d�OW���T;4b��@��7����$ B6�
�gn��33,�Id���"�����f��5K��n�����rpE�M@���9����J3'0��DE��30�r�mQC�Rot�����fd�r	jHfv�qx�2f�!$�Y����H�7�33! �K��g39�
Iz�H@$"������HI��r���fP��7��v�w��'�W��?=|F�����s��0��U���aTN����b��\mE�^�:�hXT!���]���`�I���"�e��@����Oh�Zi����b��8n��	6]�fZ-N5`3O����>n�wx�7S�,��P�{�V7"gc*7�2���3��bu��6nEm�*3|I��s��~v����N�����u|��c��Nc�p�C�LI� q��LC�;��9�y��8@���y���=���w���g���p�p���'��@�;��gEa*����rd�� ��'rb�Om��0 P��N�q@0�!Q*I�� �����;��N�������$!1�'<W��� D��g��pN��]��q�BE���!\���	��1�����x����<���$C���C�>5�������"�	�����8�f�{��� H��bpw����33upE	t�����D@~����)������1|�s�8_���"����p��q���lM�6�����ab*cYl�IR)SR��65F��-6*I*MF��b�5,�����E����Z5k565cZ�Qk56*I���*���P�Ib���53RjCFML�[i��%�QjSQ��6��i�Y-f��f�cjl�-�B�M6�f��6�j��l��PY53QQ�Ek
��h����5�J�-X�����*�E��X6����f���F�PSlT�L�JjJ�QZ��T��m�Q��5�#U������l��[�jkF�Ze�5��Q��5�mh6f��F�j6�md�e�EhS[ckF���X�m���H�l���X����T��������a��",�F�M���%M��FiEDi��i����4��6F�Y���6��l��T�F�6l��,�lh�1�V��i�MX���Sdm�Sd[E6e��6cXl-T�F�[4��h,���6l�*Q�(T��
��D��a���4&��l6�b��bl(�Q*Zhl,����i-D�4KZV�I�i(����^�j5M&���Ii5M^�5M'�5OU4�T��M�j��SI��������r��T�e4����5MGB�t���5L��MKjj2�L��&�zh��$���(�-g]>>>��Dg1�7X�D<z�
��Tb�T��������;E�D!p��V%������T�F��j�e$U�����/v��QD�N��f����������8�k�2�_k�s�C�b�����hu���N24@���Ss/��5X�%�E�iz��F���1�
��@C��/�f�k��{�C�s� �%�F�3�����X.^�K�BE�Em�6�$��S[Z�S��ZK��Jy��Tenn��od�)t'"fU�����7B�%W���(���Udf<c���^��!d��c�8��"�V�f5��8�.���z�f�!�lDR��xn�752xC�'^���Z������	�w*�a8�0����#w�4�&�$-wW"�tJ�N���V�mj���I
��Ga�E�{��j;�K�>��C�O��BJ�D�K����/5�,�%����z��t8�HM<B�Ac����wSn��B�;[��u�
��A�U�Ww54��������y�Rd=V���Ox���3�Bm���[�R���m�*�&B�J�����JkkN�]��;5�m�;�Fy6�dFj&*ld�����*��EM���c77A�m�(5�"�mX36�9{�q���oLN�QNt/b�����U��qNvn��RceF4)������o�,��
v��^��0r�o*�T`y/����.����Em�V�BI3���fw�YQI$� !$��I$E�On��+�D��Z$�_n�s�;}�0�����HI%�2����"J)EP����B$���3#1E"BB*�3��$��1�Q��Z�3����Iv���_Y����ib@3��Dff.����1H����v�� �R$"B	52U�r;v���")rQI$��1����!$�s�#30@%��D@*�30	(�]�s3E�H Q��$"L�`fu��+�����8�3)HAY���I!$�s�<���D���,�3������s�t��B!w�d�e"BL����k3���B@D^Vb����B$�.#;v�! 9�(����|�%;�~��_ooM4h��cF�4c4a(��4h�4h����F1�	F�0�cF4c1�F4cF4h��4a(�4h�Q�������0�h��h��F4h��h�����1�F4h�Q�1��F4a(���4cF�h���1�4cF4h��4h����	F�%4cF�%��	F�J4h��J4h��F0�h�1�4h��F4c1�4h��h�Q��Q��F0�cF�4h��h��h��4cF�%1�4h����0�c	F�a(�Q�F�4c	F�h�Q�h��4cF0�h���F4h��	F4h��4h��%1�1�4a(��4cF�a(��4cF�1�1�%4h��1��F0�a7���ounm-���CKJ��n��X����6��v�E|V������i�h�k"�����!�
�e�IKF����i�h�k��5�\���|o��Y\���[��k,&��&��M ��4.a�
�^����-zVV�h����������������������Y�W���U�1W�Y��j�soz�PGVk�/j�k��|+7�Q�Ynh���AZ�M�T�u�7�%����*�P}�����O�E7����kR�)0��$f3m�H��$��$�(�L�F�!��1�@c&�
4�(IDI0�`S��h��(��2^mU�_\����!DP��Ja$	B�5km(�!�I��
$�1��L����Ed�4����@������%!$�$�b�RB)�4 D���J����0��& )!B�))bh�dT0D&d���")�)�"�$S�H��}�Cg�B���E���#.�I���Z��"�
��YN_�����)���)�q"T�|qo7�L�z�{�tD`��Ve�`��>���q��]�i� ��]6���X��pJ�X15��������hU�U8�e��j������Xe]�fh��
�m�jND;����/Q����3"���V�D��������w���U1�5�AM�Fv�0���;�f��7W��!$�H+mgn��$$
�wt��]������(w2�8�eG��go6�ff���%F�������Z����J)��y�v�R$$��IC�!&����33�I��q$�;���(.g��V�HA��or!$�HF����[)$���FdQN]+&k$;�I����HUL��;�v�j��\���gs�qq!�:��H��7j���vYK�HGd7�X3�l�Q�DI��I��c2'%YwH��ef�Rg'0���ff%�7DP�B�@2�����������p��gmw��g9H��B��cdq!$�p�Ffb����	Q��332'N��rJ1
�K1WFD�����3*�2�wgwvw���e�0f��s��!��\�}�1�v��fN[���
sMY��w�)k���(�Qf	K-(�-I>�Y�����A��S�SGX��9n�����7������P�J�e]��{�q3�������Tf]C�`Z��f������k���R��q�zN�]�c�����b�������	>��["�)k����V"���U|yO���U���&�]�z�L��\]
����edK��o��h�B�rb0���9B�\e<Aw���g`��H=�����Q�x0�w2��xpV:x����l��_��i=<>Q��b�q���	���H��m��")�w��*5#�&b7�*��^\�qv�6�U���)@��
�7v�n!��
_n������j�<�q@��'!�D2�L�J����n�E�P�����r�C��qt�h�n����I}����!�E�PP�'.�D����Sr�h������Y3N��J6����tF��&������6*�@�L�M^�wYT��u���2�k�[������Y�nlmF����W�>���!YsFvm�������Dh����6g(��F8�����]����22�A�1���r�f�D������U��U�bN
�����b�m]�+t�#(T�U���n\N�����\]m�P���.�^T����gk,�����b�g^�m����������
�����bxb_7k�uU:�ce���	�� B�n�
�*(_/��S���*����QV��O��p�w|dR�! ZHG��O��������8�ju�w���3f���V�����&0^��^�36P��v��E� L�fU+*b3'29���v�fWa)�����SM:���9U�J���.��D(��KYfn0f�/���n�������c@f��cF�[`p���$ 	�C�,�k_63ng�i�c`l�!H@ �Y�hh@l���@$	�C�,�+b$ ������m�O ����&���w�Km��=c�V>��;���rH��YP���V�JU�27aUNTf=<C����������[�rj��Y9X\H�����H�@��l�d33:e��t\��G5��u]V,X�,X��X�b��b����,c1b������X�����b��X���,X��X�,X��X��X��1c,X��1���,X��1b�,b��b��X����1b�,X��X��X���b��X�b�,b��X��,b��X���b�,X�c1b��,c,c��,X�c1���,b�,b�,c1b�1c,X�b�,b�,b���c,b��X��,b�,b��X�,X�,b��,X��X��X��b���c1�,b��X�b�,b��1b��X������1c,X����X��,b��1c1�X�b�,X��X��X��9��f�lPt
6�
m�Y�8�@��@$:�`��c`(t	B��8��cC@c`p� �Ht
f�1[��������z^����
B%�b(���V���C��[��*�g���,�
B��8��c@��l���$ 	�C�,�4SG@��,�
B��8��c@� ��C�H@ �Y�h��C��(t	C�P�6�h1�8�@��@$:�`������(t	C�P�6�h1�8�@��@$:�`�����C�H@ �Y�ho����\�7��f����g?�?��~G�T��9^]}�~Ny������>�������]W�����!����������Fb����.u���y����_����f;���{��w��>zuMLG�q������o���������}�k��������g���{�������ffd	���9�ffff9�������������+�'������������|>,�v|=��P��Y���������VJ�y��}�����o�~v���\�l����>�c�����g<���������;���f�_!t������}�����[����zn>V��f�/�Uc��}���#�������+"_;7�y��f~�����N���l���~�<K��}�������!y>d?�}�{������!O���k����������/��{d�q��{E����h��_u;���{v�Ur)w���nf]����
���w�w��]�y��dffchzQs�Ue{���nj%I�PM�>��Ux�`[��O+8�"����"&�]��Uy�o��	���w ��<������]���rC�ep�r/
L()�(�F�����*����w/5�������3��,�
[�oZ�����)�w�5�n�����X��,X��X��1c,X�c����>������1�8��r��Z�����,�
B��8��dj5?s��fnbs0�`8����������?����8��~{f����l�r�V���$�Id����7W���k1w%61����aw�@���*�=H4��T�n�y����U�-��������T}W��1�������	��LI$�32��M�^^�� �rF��)�.�A�8GZ�2�������{9V��6X�b	b��	f�z1C ��S$a�n(n�`�
!�3����0wt��2������&E�s%�(T���Ch?_���1��(������R��@���s1s�\��w��=Q�&Y��0,�$�[U�wyJ!9��S���F
�� ���e�59p9n]g��_?}���&�v73�N�������O"`?D��B���w�������^�~L����ql	��RMMW�f��9�re;�(�A0L�$�Kk3.����5*$S� ��`����	����w��)�+��� ��<�����������ms�	�N������C�A$\�fY�$�+�&K��
M�k�CCaI$�+2��{{���y���(G��b���O3r�f��BTr������,^�?5���|�|�f�%LSI$H/G����2c��SsF�9��T!a�
d���I��J��.�!wCS�����;���y��$K�G\�#��E|�,I �>��i_�^�p�&fzn�U�C��aw��74i�`�� 4442`�$$�or�����AP���FS�D��A����*��TD1�h[0���9S������i�^.���M�L����1�)��sy�C���������{3�g��	����P�	��L�`�I%�Y���1$���@r!E�v�k����� ��^FT�H��K8G'��=���6�M�y��Abi_&�==�!d����7���Y����33;�mw&7+,MX5�:����$�J.�339n�y��X��g ����AE��q����"�B������<�o���3���^�y����G�'���T�<:B����v/����Ep�EK�G�~Fx6c��H =HH.q{�>Vqy�n���9���kY�7��l�(�P*_�X����~s�������
`��ff���r�w����o5�!�����^7�W5��Z71�_�Ys|{��1�����o'����t��������sa#c�O�r�q�D$�\fb�BU��?&&o3�U��Ed���/.q����r]<�����������S�	���Nrs��9�Y�����D�2�	�Vo���T\v�\���\��{3�������>����>�|>�|��|���|���>>||���|>33�������c�8����yK�P�9�q���>�cO���wg�v�E��e��u�\���-uwX�f'.uI�MV]]��wsUS������6�F�r�7���s�K��|��[��q>���'v��#�������6^��v�BF���n*[#�u[���������w�l���9U} I%Y�]!@n�^���S��������Z����{��������|�!��.�Q��X�=a,�k'b�b��b��c,X�c��,X��X�c,X}������{����l������������{����Y�:�!�(p�Q���������T��O������������|����/h��5?���R-�d"���2����Z""��G7|�wwudc�����������b�Ll��KgZ�-&6�����1����Kcb5Bq8�w��46.���xoH.t����l��rLRhI��i�~�.�&g����,;�Dj��Q d�!���2.l�Z���-�w��8�H����B��"!�������V{[�!�O��ffff�7w[wwq�����������7�{�$I`�#��9�b���j���y3{��;�����ch�5%8��D�V�����v(��ci`Y> S>Fw�,�����-�0iH���J�0LrZ":�C8����N���%;o$�����D$�1��+��,�Q�����~���}������u�����{�����������'%y�9tm��Y&�>�4�K�����ZX:IR���u����x�`���O��^s
��La2� ����"�
'Q�N/(J��e+;s�#L���4�
�����Vo�&f��`�3S��GXJP��������	�7xlk;�hC�%%����w��O7����_�N^a�9�������Eo��;��^3v�mW�K�Hp�,]`�CgI:���y�<����4 ��D�2�:b*&@�'
*.��%���C]��X�zC�������DG"+/L}��� L��� i�� �X�Lk�KC��#��T}D���^���`0]O��b3�>`���@7�t�K�$�����nye�D����s��x^#�"��s��d��c�1v�-Z�:t@eN���wwu,�1`L�)��Q3�.���A a`���@{Q.CBA'uiH�������'v�N�%D�%~���l�LI��IA�IHhi�����J���6����Q�t������ 3�����6E�zK���p�<F��0���2��q;<aN�f0��3���/;�����_�]�A!�j��b�@4bL[2
��A,�A�gz5t���Uz�s;����wAf�79,	������p@#���#�!��;4��F�IV��AM`'�S�%cc����_����������]�.	{{:��u�#kn�7wO�q�I�����z�������J��<���?�����0Uw0�4�	aK�J���QbW�c������	hlC7��XGl���%����(�w)P�;���1f��?YT���9�oxJ�1>�������������w��C��K���qsI�{��7���O�K���?��B3���=CK�X:IR]��wwu�g X��a[j�1 `c�0�#���>M:-�;�$i�br��6	�76��m���y,���R��Nf'q�H>�����49������3�D�������$^8�������5,����������u��Zjl_o�E�:X�����~ot���\��Sb���+�����i��}�
s��L0��X:Iu)��������������kZ%,�i�0��u6��u����o0Kw���J�AUV�����R`�J�x�X��H@�M���B
�X��E��H���cI���M���={����{�u���"0�|y3�q�p�S��i�U\��9����sr�ffDG#wvg)��0�R�z��^3&�H�D��&y�}���{�y8�
��9���SO9��^%������b��s(������#5�V:u�*�RR6#�����X9��x��lM
��Nd�D��E�lLzxA�3�
1���_=��#N�Y����wcjy�i�lfg������;��9��//�|���{���nc�{����>o��3��8/������|������o�����>��u����3�u���9�Y��BI3���-����5������n�^�����.dc0f�#���\���/�Wu��2ej�eSU�Wk����Ue�;S3��l !�
UV��n+.wui�n�My�^�>��^��|=�<�(4u��"��d�\Q��tK����bV�������/7����z��1������M�g����V9���n�������s\Ut���F���h�P��dG�t����g�OZ���]�� 	����{7{���7���fFff/�7v�_vk�3��s/��F��������3��2�{�Sw��S)^l�DffSS37L��f�&�j����M.��u<L����B�_��|�+/���N�%X5�K���rV=�d,�&�2���C�!$������V�/�*�M�K�q���]]q�]���e�y�1~��������-�3Z���{^���x��Uf�[��j6w�����#�J0�*FQ�2���X��,X�c1c,c��,X�c�����Y����z�����{�6����������4�����{����,�
B�������9��[���UC��
�����'�~��������k��n�G����w +�0L!�0�t�X�g�w���x����]���P�Np���E��pX�bf�g>N4���������JGg\��^�B��SG�ph�E
{]M��y�m(C`
�����I@�Z�����^U���<`���.A&j����*��3�x��|��{+�����iE�A�����j��aML$�.�{����x��,	)�c7��9��u��ly����N��|�X��]��sI@�Gp(v��z�H({e1�=����w���(3Xo�x�����o
U���:"%,��������r������>7�w$��O7s]��v���t�K����c:��'��0)�40L$���oy�������	r��N<���`g,�T�M��N:�P���0�����hi�1EX�4
�h�!C@�uoZ05�%*& F���z�"��bx���n
I����	���hOyw��#W��v���Bc �qW���z��2�(�����|��C�Q�s����Ja,%���teUw�������j'r��JWw>U�bi��'r�:�I��O}�iP��r��@:�M�����4��f<�v�~}�?u��;��g�>��?^��}�{������!a�zD
�w�0��*��������1u��*���h������_��w����:8��������co{{���q�-��rp����AS�n���u�v	��t�U��wwt�8#��8E�|t�b���Jx�t1������b\�x"%)�5�q,vL�_�LLdw4��X��A���q��"��m��cq��#h��R���0B08`�q�����7�n����<<����	rIr]��p\����.���9���w�����S��!�3�A/�w������C�� 4�/�rjY���'5�N���1�J���%�b<�������n:��,
����%�D}8��l�%'!�sA��]^E�\6�}n2����O1[Df�:��_Q��A�99����X�w����\��[�����C��.f�6_&	�;�����7���8�
5����F),A��-��l�W4�1����"f�*�����u��$��������`�zz[~B?7�OX�:��K�_!*��`_��!)�R�D
6!�'1���s�7��}&
�q��V�m�L��L�Y�8d������H	���$�7��n��_��p�����5���q@����+����N�	�����nf��;!o@����<������NL�����m���~��}d�gU�Ln���9s��F|�����c�����"6D61��>�[8=����~�7����e�-�
�~��lETB`Z;�����{�����zA�Q��m���d��$���i��T7�J���Tb	xs*����v�Z_���Y=�y���@�Q�R�_��V���d����!1��v���X2Q3���Oz��g����m�sed�fff�����k�{���b5z��r�������=�/��L��$�]�s���J�%f��)qI�b��� �E��d��1�YTL����J�*��������n�cb90��u
YP:����i��S����pcUUh���"Jn���)46
�w����u�~9�e��^G1$#���w |3���)
��^��N3���h�������'�(2�;��a/uo���Fy�_7�����rt��g���/USS���.����b���k_oZ���W;�~�����M3�(v#32�����fgs5Y5WWlfr�quv��\�3sj����k�U��Wb����������X*�N����EgZ�N��5:(�,.3j	���l�����q�����s�2^�<'�������
=��4�2� �sy�y�\�����w3�5Yuv����UV[3�39�f����^]���mSK��O&�.�9��9SzoM������E5�N��q��H�6]�wj�9�F/f�^l��Y[NMeu=�D�jLci���2���UV� b�tpQZ7dh�L��������|41n^9��re�c�f� :Y�7���p���)�"��O21Z.��V���,����UFV�$<���������1Y����W.�3v����{���U�>����[�eg=�f����]�����cU�1c�,X���b��1�X���}���" �4�{����{�M�S�Q���������{�����(t	f5��y�7��oi�
�cu�wj�fK���?������9������_(}
��
�[�.fe4����%+*�������X�R�`��kKxbc#Q�����7Mg*i�BRb&/����-��R���y��� 1f0��CCoX��~���K�z���JA��J�Jw���$��Zo��H�w�U
����Ab�����,�s��~�{������5��eDN^��3��t�w`��&� �]��{����2�f�y(��L�l�flm�����O&-���T���J�1�c-��/������Z�,��m9oO.���eS�m*5�"�g�����z�O����@�5~���{w���v�i�9�L��1\�g�������Q��!���IJ�f����p1`H\x���F�VM��1�"�P�K��=�����\�X���\�QcW��S�I�4��EI�=�|��6��FF"d��&8�B�<g��y�z�|�c����X�b<O�Q��<�*����oV]������,A0�������RX&L;���n����Vx������z���S��P6�
{")��&�/���!���i`HN����<��CqE�������!+ DKD
[��K����I��"������$Q1�z���UO�P�
�� ����zX"C�69���6m�j2�C����)�&L�$��������Q�(~�$���Yi46��n���SbI�EC����w)H�@��A3D���}5<�J�������R0pH��Rhm7�c���E#�H�9�N�UY���Q�����q���}K���������*���\S�j�TC�,!�0wt�����w7��\� �%���9�����M��J����)H���764H�n+;�Km&D�QC��m+X*�:�v�y����p�%
��N�0�.���r[������w��Y�?X�	I%ry����t�����N��������].�K&L�;�Ir�7{�������A�P6�bJ�R��V=w�����I���~�"]���������=���6����"����IH����#��)Jz����S���^ v�����ud$������\rR!	�p���5��A�A �5���;��Lv{���q�$a�����e�XK	`��%7\�������w�g�iB��'3VmRN,����7wd���3&����U��.i�����Und��J�����MM��UQO�]�=������:�$�J��JP�&��I]��S	�����3S16
]���Y6d��q2K���UN��D���sh=�QS.l���������:�kV�d����� �ef�CN����V<�`lYhX��(E��Qk����M
��m�P���/�u7�&<�j!�calS$��3���J��Uz
U}�z������*��9�@
�L�������F0�:6?�g�����bk\x�������F���gd���%~N�����s��������b2�+@�W"��G���JX��|�n����k>Te��73v������II��l^La��)-�E���l��>^k6�%����S�n�]� A��9?Ma�����Ar�����W�0�fe��Y�����w�{�������:�8�3��K���ZC`��%3�g9���������L�y+U�9�H�brBu(��8��8��H��P!�31)Qq'i�$�'Z[B#�����a9�"G��d�T���,�0��q3[��v����w��������+�d�,�;�w��F�~/�f�������?��6y�����3��(�
����f�z�w\�����|��?Fn���>�s�������=��y��������_�����������vM{Hm
�h�C��os&���p�����F�9���~:�<�-$zzq�L�������L2g��� x��#������U�F���������P��D�jLci���2���UV� �(efs33
�:rk[��"'����w�Rw�����7n���r64����,��]��"h�d:�
tB�^'�X�zNi�KU~����r)�w�����:��L�q?#)4,@��-�����w�����q{��|&�rurn���!i�\������.k3�(v#32�����fgs5Y5Swlfs&quv�MGS5%a���������`�
X) ��e����8V���5���L���vw��Q<)��d�E��fa37u71
3	������j�������uT���/H�`��F�i�@�n��[����!I��Dc��=q���G�<�;u/<��!;�1��3V�|���r)�iiUv0Dl
&�L�sw<��.��7�x+����h���}lgY�s��>!�#�Z�F��u]V,b�,X�b��c,X��,X�c1b����<d�6��]��}y]�_o��P�6���{����x��C�HDG��}��s�����M`�fT��
5��x+p�_������C��fa|������s���i@�=l�L&wt�������{����f�KiCQ28z��ZP>�X���M���9��d{�����{�lwc����&<����Gm�j���My0�b���p�b��=�����\���ye�=��P�CC���������ye�?1�&��%��<�r/�����������/^�&}�������wu*G��2d�����m�9���g@r�AT:�����Y���
��U�����e����|�%9.�/���{C�o���0}�i�ps��a����>B�1�A�3�>��/��d���\Y�w���b{�6��v:\r�������Jo/����~)�<�����H�HY�|vX�N�A��A0��qlS��%�:�fB�f��VaR�*v	�]W����;�95�\\������+��P���@5��%��;N��+9�yy����1I8�`�&L�$���7�z��~��~8��%��&����f�0��@E��F������ �$	��@e��������{����}����&���z7w��9��������7��
���k^C������`�w1�&�{u�u^�O���I$�MF�4r�l��"E��I������^K������UU���q�sc��� t��6��[����O��)N\�1��x��9��'59�yx,���#�V	�G�}{O��nez���0���������|���~��)��������L���9M.����6��'b6��k���9`�;���OS��{��sAO_6|��3 	���'�{F!���N�����|R��>O���+s��<>l0k��'�P�,�sX�
�TO����4w��V)����,��S+�]��{i��H$�[���s�(}��kun������9�-!����EGo{��m�s����@�]��=[TI��4i�'���.o��yN��9� 9istP���10����G4����(�^��i���$�����������o��|���(��H�s�zj����T2��2`��%���{���_�86|�N�Gmmn/�"����
��%u�$u���%�fK�$�w���$0����`�9|G�p�i�_����o�W2|����f��
�`
�A%���s��3@��61��=�������?���5<���O��X�t`�&��[����y�i=����P��J	�D0�	3������cH�5�{���,H����vd �D8tQqn�f��>���'��>����_O�_�A N)�9���Z��n�i�3UUUY�����Q����{#������n�����C�-
�������+��O{�����K�fMa4��"T�0K��IP���!-E
v��0��e�N�G�u��R1ED;�B����wk����V[����Kar9��! x|�\{?���;��=�G��?43O��0r����������E�`�;�x��5��6���#�}#1���g�=p8}4������x���`�W/������wT�Zn�2�[�0����oww����pg�a�������I�X�c3	������i�fw/7�UT�]L�L���T��m����h�'�1'�g*u�d��E^X
5�Vg�T��q��1s�)��
��00W$f� pT��X��`}����|�9[��y��r&&R��F�;����r��2j^������Z���UV� �,p
����ow�����Z`���R
�C3�W�BpN+��V����<��NT���*��G@�����L�Z]��	HL��l����u�O#�vVK�4����E�F�. �E��������5z���MY���Z)�z�dL�Y|#��#�n��3&)����{��������ft
���wwwo��	$�A d�s�<~50�K���������,X�c,X�b������X�,X�" :�`_���Dzj+��C�P�6������k����C������������s�3�ff$����r����>|���z��^���BL��q�wd�0L���s�����\��x���AD�sX�M�`y��
j|��Y|~��L1���T���<AK`B?�f����-6�=wt]��sl3�\����74fo+�~������>0�������oZ��&#'+_^�����r�7�6��]��������sz�N���Jfr����5R��Ba�q��W>���NR��h�������Uf/1o&\���QmB�����fO!��7���|�yd�C�,H$'������\����B���5�o.X&�0wt���3�����'�������^]��.%f������Y��z���/�O:xg��m����'1-����!���4���W���W<*�g�v>��t������2�,w����/�)�n_wu���fMma;
�KCC`���Mf��s���P��k�w����5�P��D����%���v�������rN��.K�C0�HKK��_m�esj��"���yN+�������_by����1>����w������Hd�0L��%9����O�����8��g,�N�T>* 2�<���V�h�1m�&]���5!�m�K"Q��Q$�!�����&�	��6��8�<����a��O|��-�\���327�,���@%�r=��������s�Wn�z=�cP�yvxr�L���J���9���?��y���#~K@[��Z���07Sn�U6�������"�)��O�8@��=��vFvk�x0X�X��V���������[V�w2�30�pK�s.�����
��3�s�S7g��(GW7&�CC;�Im��9�v�5Y�r���En�9�!��$��M�����R��y���[����N�l:�(���tK��j"5� �M�����1�y���%=����R�:e5�6	��?m��1�Q��C~���n#Y{?��7ON�R�5?>���	�`�;�INW9�~o��<^r�#��3�D�zH�	vg&��EV<�am-b�wM;�����+�0���v!��?fz%@�����B��������s�����<3����t�w��F!��?�c�����;V��(p�����>��������{+����V�M����������I"RJ����{���w}������x�Z����4����U�����7��c����n��2����	��)�z������ Ge�`�[$P���{������l���X�n�����������Fe��TV[��UyUU�]f-�V��4�IL`�$���������/�LsSM��)��\%68	F�T:&\ `�`Uq\�6f���vT���s1����_U9K�rh��7�O���~{�mt�g�}�����^Y�}E���������1p=p�e�]�C�3�Mo8{��	O�a�Y��=q�f�s�7���(���_s�0�����}������Oy������������=���>v,}���#�6>�zgAG~����\��s�|���~%-V�q��\N��5����:��z�[�8�<Y�������NS����U�����&f�Z�������V
����d*'s2�itL*`%!2���T]ky1����y�vf��q��������S2P����6����Ll�P��Z�{���G!dA���_#�����<E�J���P��3���8sF����W{3sQ#2���3�0U�^�IV���l����7���fy�a���fj&�nbf�����M�5�S96������zC����[�@2�@h��[ClBF@���}�:����rr��D�RY��L�)D�;uvU��dF`�
B)T���M���w���;���ero�z/�T�T��kV�<�hf���s��k(d�6��M�2V��975��c��/f(���v���zyII��%R:
,o���n����z!�(p�y�n�������4t���i#H�8��c1c,b�1�b�1�g�}����O��O�33>����@��]�����X@ �Y�h~��{������,�E}���s��x�jh{Xw�w3S�������~��
����w����{�Q�2����wz��ha
H�x��9�w��v����]�o�"��D ��x��f�2\4SM��'&b�'N)�����M��^P�E���(�,'��������uu�~	bLm���{�?WU���|>����>e�h��qv�A��x�&	�`�N�������;�����Xxj���$�c���Xc6������Y%~�,7��4��ja�b��D�R�R)������������*������_���8/��@%����Z+`5Lj��\Y�X��3��i�Z`�&LI��<�o+����	���D� Kw�T�8B5N�)O�B����|@��A��	H6[�w xd�u:$��Ax��",m���a����0>~}����y/��"!L@�m���]�����',m�
X�x���,��T�%����X�.o�wYy�s�|hm��������Oc72*�����K�r���`ZC$����{���o�y�	=>�u�D5I!n�L������Y�I��\M;���'�%����v�>3��-����@�e�,�J�X�UP�i�03��[�'>~Ur7~���_�����}�������X�K
5������}�w��������@����k��b"����M-��2�4���n����6�P�v���Mk��cy�LCa���-K��Fr�9������u������	�`�&	$��<��V_���?Sy^v$Ga���]�<q�f8��yB��>h�����|��-��z��w�l�{��F�#��Y��r��@^�8wV];�wb��D
 Lr�#�0�Hfpi�\,D
�&4a�Q��zF��O�8.�6�M*����)O�%��fY�<~��l(`��CX`�b�a����X{�������������ld��;�����e��u2l�Z9 ������ZXK	d�I.�n�}�s�����e�wf�g�G�Mo[�5z�F�2���%�a�cl��O.+M
�h���%
e��F
o��������K�~�A�`��O#����i���af'��G^}K�z��a��%�V$�_2Czj�C]�!�P@�U5lu�u��64�4C��D4��k�]���?���DfG�����^���E����j��������&��I;���o"���u��Dt6�i23��=@a��FZ
$sC�8^8(DI����>�i=�n$1�a��r<�mDBlm��d`(����k<�Q�B��4���m�����S�P��&:���s
��Ec}�M����Yn2���Y���{�FEnM�e���b���kr�o���Y���V�#��\�2e���.A�M�u��a5���L0��$�����zk;��|�wA�e��0D����d6�M��	7�F.�*O��t��CP���,�X�-gpC)��M�KM�p��0k��uzf�T��c5$��]���u�Zlo�
���1N-6�������t�`��2����c���X�A���D��8p�s����EG������>f}r`pH����L�&	$���^�wv�cbVX�K�� '�r�9v:�UP�(u��c�XM����%��v��$M��f������(�������Oq5��!L��r(`'!���������i��&HlF�������������F1�4���v�k�!�6�lBmV�>b=�:��Z�����_o{��>Sl���u����Hu�U_9U��������{<u'�u&	�d�$��;��wwk}���t����X�H��/���tM�r
�q
���@�f�Q����p��
��&�����*���4a�22��[��W'����E1��#��V�b)����Xp�d�L�lF�Cl��s]P�Bm\����6&�w�!"�Z�6�U�l������of�l�m��W�{�����X�To�w��k?d�`��a�N0�� W��voH���3���}NF�<q�z�:�����g<��b���Y��	�~���}�Hx����]W����^�\fJ�����4�ys}�D�JY��L�)L�;��U������:�$R�02n&"-ss8$P"�5(�x����P�l�
�X���s9�-��|�V[9<����!��Z9�4o4E�D�l�I�Q3f���E��/�.�T	��@�B�3%FEM�UUz�������P&
2��)������ZH�6��P
A^�s��	}�)���F��,�Us[�wv�]L�URGZ��T�K��TN'9�b��23'�7V��l�s�N�UY����2M8��jiI�"�?0�;b'$S���gd��sE�MU��:��qF�U�'����$��;��|�������M������n���zB��^{�����g���9�GJ�k�mF�U���,X�,X��1b�,X�b��b��^<x����^�x��B��}���2�u�@��@$:�`�����{����������k�j�]�����s��������B�f&$�|���[�O�_F|o�`�����^��y
Ub���cxJ���b��$�J�������P �����fb,JNS�{NO#%�K�6C9H@���-D6�:!���4���`�8d?�u����|�>�_}�������� �!��W9�������?_���>�Z�i�]���u�����2`�&	$�������*}�3a�Q���������hM���H�C�����u�/��w�[;���yS���w%o"+)�g���9�
����=�L����y]�w��C��w/sQ������u�"N�Ge����I$�k/7��Zq�����{�{B	���E�U�v��5�m�
��0�QSi#�urf���}�x��s��sq�g
�$+��s����Y�"�w��'*m�i���L;�$���]��wj������:b��J/�����L8=H�R�%T�wr���$�5((5�LB��|�T���U��F�����X�$�Tn�7���qfd�����q�[��lD0��$���_��n�d��${�^{�x<���Oa2uF.W����:� O@��Q���.dW]����w_�=���2�M;�K�|b	g��l��^�H���jw���Y�LX&L�N�|��;��U��I�<�L�I"S���<�4RA�S�Hq�?�(y�*��iC�`y2 G/&����w��6H
{(�������T=��@@�����9vR�!	�p=����`�N�����wk��s�����xmC�5|�o*Ev��q�W�
9�u��P�C�)��\?ssD�$���Z+ :��gu�]���	��������t���v*36d��y���3���I$���g{����<~T;y3"i"m�VWhH�p�&���wA9,q-q�%��"�l��A0�����6|�l�@�A�ywb�����Oyr]9G��3��t|{�/��������
B`�&	�I;�����wvC�|w|�a���������������� �OuL��>}$�\��������.N�(3��,������}8E
���+�i�@Y�9�����d�b��������3>��+�����Ix��N���ZXK	`�I%u��{�����Ww`�
��79#�(B��q�ru=j�E��9���qV���-�f������i�?��6���i��8[�}���X��{4��P\�����y�g��������������3�����O�@��`}�/9�{{����j�G{�>����f����3lb������!|�1�7�9�>3,��me���y��Q���$�R��D��31�����\~Vv�2Sj���UL�&��rk3Z�5���p����UU$p���u�#�K��TN'9�n�##3su{;�7N��VK������P�c�
�I��\���mb-�m�Z0j�������h�QF5���h�
�~���Z*osL���\N�������95�e8���x����������3�~���n���Vk���	����
9���1Q[y�2��76d�a�9y��d���$ 	H�R2#Y7��Y��w�������~�� ���s_y���C��3�<���`��2������o��qgn������+M���aO�oq���������=�>w���=���S������ke�R�>:w�������~g�N�Jv!��u�z���w�<=����=��l|���6�����}d>����|����|�{���
���y�#�N�3G������:2����J��fn�9~���������@]���}��E��o���������u	�S���_�����;�_��T���g�^]�q��������y�]���&����~'�}��=��(.<��?��ls��C��5y?c��~R�3������?K���� }��x���>`�����*A�Q����5�db����VVv��YV\H�����q��!�(p���|��������,�/���z8� ��������b���-��wE�����&%��rwvVl���W*k7�w��(`�F3W;��"���������B
�;����/7�����F�����6�]X��X�,b����X�c,X��X�c<z����^�z���1�$ .3y���jE�C�H@ �Y�i�=�{����T���=�{qU�{�a�����~�����������&�I�����������|!��)�;�$��k/������UP�v*��%�;��$�"���B�\��$I8�_�?&�����U�\Qk-��H/��GE_��[���`� ��[�M���m���������TE=�Fff�"o���C`�N�Y��wf�?��������"Z������
������%5�/���6Z2���-�����i��������Q�z��^���S�x�L��H�w��y�����~$���D�pI��������%.f�M�$���k��:��=��}R���;�i>�*i��^Z"s��z�_� �#|�\�s�U3������:�v��oNQ��A2d�0I$����{���\���������@!K�P2���l���\����7}j���#��|���S��?����{�?cr_�{����������������;��~�������a��sW�T[�z�Xz�Jve�������/����j�N���L�����1{��CC�.^��9��������A1ys#<��I����V���\^�����s��'���A;��9��b?�� f����=}��Q,�x��W���l 1����Vu������yow;+�����<&L�&	$����s��!���y/;'�0����E��wO����k��fI5��O�����r�"�	��L��������o�H�/�g����u���C�^3��A��|�����z�X����?��e����	$\��^�{��u�N���dU�����H���r���u.��Uc~�������������^����7aN����W���G����[��~2�,�30`=7�Wwo��$����f�S&�2Iw��w����n+6� ��������.�L9��O#D AZ|��a.Qrp,�%���2�zBy��x���������i�5w�|�f`�z�������sc�[�P��������������==�%(��O����?y;��!��0I'w���swb��#|���kl�n�2D���L������$z��w�����5��WWN�nDs��N��K��0�`@ �m��=���FsZ��3?U��
fgn�[�_{=������G(����m�)���H;�������K�D��Leqp�/���>g���w��N����[��d"��J�����r����CO1a��[���)��)�6R�i#�{���rd�9��F9��t���$�2}/��N=��y�[
!)��6����M��vb�o:<VGq���s�?n��JJ����Z[0�k����f�*�Ur�����E���Dv��$:3y���f�1;6]���]o���t��S�m�����(�Y�c�����[��c"*�j�����x�*f�	y�����(��R��j���Bl<�Z��'0Ju�$L���Z�3S7UU�8���DB(��E.sUQ8��x��������ZX������v.C����f�"#s9�X�ffe3�����_]�����k��� �����(p���i�����]�3�wYw}fd�_d�o�K�32L������,��SZ�-���:fe��1��{������Nf#z�������s332��fVf�-�Wk����}�][3}��.�6�]R�JUV[Y���Y�:�����w5�o'������}���o.5�eo%�ue�[�wS�g9Y�J�T�3'29���v�fWa1snoYZ���������j���VT�|���yR�&�ZG����SV�7N���[����P$�%�������q�
��������}>�������{�p
��[����z<I'yu�|k[hk��]�����1b���b��X��X��1�1c�<}U�{���gP�^�c�s�f{�Y�:�!�(p���N�U�k[��M0p��������r�?mw���x*���q������fR	�`�&	$��9��n�@O�O���������R%�/����xQb:�� �w�E8wwQ
�� �[��ix����j���<���A�����{������������������*��}�o�2���R���!�-�;hd�%�������������J������;��� `��$�o�������x�Cnx�w��s���s��-�w7��z�^���`�e��gnkD��h�;*/!�����=�r����2d�$��\�;���w]������f�F�
 \�����@ ���>�7p�N��H�������^"�wNN.�16����
RM�Oa���=|g��Q:y�D�����;G������%����$��[��n��.p����SG�H�eR�S�oQ�y��bQw��(�J�.`�m��F�Gc���������p\�,K�N�l>i9�����+t�����)�`��wy��{����������v=N �S�}z$�����I��4&����)�_�L8���-&$L�������u�1�5|�s���@�VLu��9|�������u��v&"CC�wx��;�����F�C����ba��*��zx"w��^H�:E��{�M�������Zy����$��������T��N�b�|���C�/\t��L�$���wy�����M�s���owz�%�"���IQ,�H$/M!���
��e�I9r�����"Ql�B4bE�-�i�A��KH��Nl)` $����I$6�@6fG]{$s�b����LX�$�	Y�g0e���)���ts��e��o�rm���$������������y�Z�����`������Lo�`@E�:.�U����My�c���������v�e�������o��c�z���	���jI/=����<z���G_���n G��>G������`�!��H;�.�������������rS��
�S�T���*:d]����f�|#��{��Z`���&�w�b";oT��i�����z6��q���o]UEu��3'2�b�y���F,�Yt���0L�&	%DE���w����O��;.�����!�?��b1Z��^��L��~�NL;������@�T!�?3�����=��7��C�����[���4���Xef��n�;�����w��3(�?\Vz4w��������s{���O�����|>}�����EF��r>}������o��A���o��o�;��G�?z����>8k���F�}�-�sx�����G�������M��m��2d22d
�d3f�C!�!���23�&C!�&@���!� d2d��32�2�&CfC&C�&C&C!�0�d�d2�21��!� d1X2�22L�C&I2`�d2d�C!�!� `��d6�d22d�a�����d2�CfC&L�+���&C&C&@�X�d1��2d�L�&+3�C!����������C!�!�����d��22��&C&C��C&@��� d2d��d�0f�C!�&@��Y��&L�C!��+L�L�!��c"m�C!� d1�d2V2�L���������d��a���2d2V��2d�2L�C!�����d222`�2b��2�L�C�&L����&@��b����2d1X�d�R22��Vf�C&C!� d�fMc!���2d2d1�d2b�d�d�d�`�2d�LV��y�i����U�\��il����EU!c1���Km$�$������%����9tUR
�#-������t��W;��uVQ�T���K*,X�$D�*#JH�*$�DQ�@���FT����0��DU""%������+,$��0a0���*Y�$D�0a"%��,%Kb�Rt]����[����t��FI)Q"��*������w]�����������;����5h����;������!� �C������������H
�R����`*"BTRTPH@z�uZ���s���@* *
����(pw$p��j���!UT����E>�H��"�LE6�L"�J��O�����	QW����E>�O���f�E+�$S��wO)��)���*��O��*�~8��"��?������)���O��*P���h�j"��p���E9�S�E?7�������P�"�H�H�����$E;%H���?�]�����
~�5U):�O*"��")���)��{����p��Z"�����"�$S��J����������L��2)�E5{�"����H��S�����S�>}u�)�DSQ�}�)��R+��j�MR���j"�T��wT��7���R��O�V i�P����������w�
?��}���f�����A[�g�Ss������9�
3O��:d��<�}S]������y�NGy�'��������R�����������n��g���="�;�H�I�v)����/�!���>������p�}���$�3^��������_<�/��m���o� �m����c2�Xd!BvDy��w�]�jf���#�a�����H���
"�o���>����9����;���-E��^s��Ha�����y^lV����$�}|{��O�A�3eg�[!ARg����:t�a'�� '�oK�d��G.��8�G�O��O{������MO���N^$�x$�t�M����qUU��)���`��9��*����5u"�**�H�-�O��b)�QWa�E5�=**������6������F3��!����~������P������-��[f�L����[�l��{�(����8G�DoP����a���se���;���T��
�]��o�?v�����F��te� �%��w>m��}5M��F�O���s����>������o{�s;�g6�c��d$�|����:����\s�c������:�s_��dW�|�>�}~���;}��.a�������7���~�?�����f�y�����W��Yr'3���B��{���{3:�p����n�H�/ZV�v�B)�v^k��������V����>
~����_�?hk'��~���~�{dD��G�9�u��Q"`L�y���������j#�7�o��%���^���>	���>�s����+���������19Q
so�������>���c����|F/������l}[�[�����:�9���T��gBIb����>po�}��>|�w~.�����}���V��2���	���g����4��@��~����=��W�������������� �'�����
M�{{��wR-�����A��W"D<��.Z����t��&)**��LZz�A�I!��^�AjD;��w���sp`B���y��6U��,�xxi�"�
AE����<�oN�<������T=H��"��H��RIN^,8���D=*��$\XV!����V�{S2n�H�v��*����/0D<B��IDTL�D��(�".j�S��������{	q���)�
Q�]()LU�T-����6(854$+yu
��qJ��%���2�<��p@�`�x&��@�ur��	��e(wW��3/*-�*y��t*+�b� ���
�)��5j�?������x=}u���jw�o{��H
���HH�k�u]v��������Ov�"����E4�*��e
}���{���m��K��QV�mnE7�b)�)���
�dS�� _�k�?�?O���a~��^��k�����q�������MF���l}��������yK�5�7w9����7�~b�>��=�lR�{�9~���|�����P��\���M��H��E���p>�������u?���~����M�9����>�s������0
�~]�w������}���+W���R�/�����>W4y���Nn��`O����1���<��_b{���F~|�b���e��^�2�nL��)�?C0d��$$
�g�����y�l@$��$���SQTU���
��_�[k����H ������EF�6����E7�O�E7�M�E_���uQV�Sy��������~�������_��=������c���;�/[����?����'���o�*�����w���S��?~}��H��������C�;��]������0I������#�����������_�r
��g�����Wwq�O{��]��{��V�����os�=��g���� ������s���� � @~�������?!'�I^���_#s���;��.7��I#��B@�t$$�$��! ^��S��>����?�����_��.���y�5�z�B :BH�A"�������*���)������:}��XE;>��N�e�6T��z�O��**��t�lE:r�P-{b)���r�s�QW)��R+�-s�R)�DJ����O����m��km5��������_�M|~��/N��W�������T|��y������w�=��y����;1"�������=�O7�9�����;�w�[��w5���;9�oq����X��w�v��<�hF�k��e�\n����y~������s��'����H���������=�z��q��[�V�7}��?B��~J��k�y���g;����~{��v��Y��Mo~��t����n��|�����@$��n���F�������9��C�"�sgw��%K��{�"��TU�E<��jT��QW�TU�
z�)_/�$��B��O����Z+��a��"��**�MMO���iR+j���kDS�@��[���(p�@
�
{�QV�5Z���"��6�a*{�H��"��TZ@.")��P��TU�"�}$SDSnj���)�H�p��~dS��S�J'>Zj�:��8]/:vN�S��o�����"��)���Q���={���]H��)�R+��lE<}DS�����Mj��������TU�����Z�e�S�����o**�T��"��m��W�(Sv��8�%�-����w%QW=��Ur"��)���k\�ST����m��]6�-MH���QWB)����M��B���V�"����DSDS�������ST)�JU���M���)�JEe��p����^�r�)��wLP��SDS��E<�z�E;n]Aj�.�)�\h�jE6���^����r"��E8�T)��~����R)��~�z�UkUQV�E5O_)��"����R)��y���j�M�EUW������"��V�E]��������S�N�)�������u**����S�"�������gn:�Oa�
w��)�6��E;����\�(��Uq��%I���H�H� Nz�O�
���z�v�$�W=;"Q6XE:�$Sg���m**�q��E=[o�7�B�5��r*Rt�|��EZ"��nE6"�T��:m��E6��O|��b)��j"���H�������J���J���S���U[*Eh��=��S����)\;h����H����vh�*��w����f�x�SR�j���MVVk+5����ef�Y��VVVVk*�ef��YY���V[+5�J����eb����Yl��Y���T�j�MSQ�iT�x�;M������d�nJ�SQQpUp"���<8r���E2�����#���1��1�F2��cDcFk#F"��b3Y��"��#��F#�1�k1��F"��c�E�E��E�#F1��c����J�c�1��db1�b1�j��"(�1D�F1�c��Fk#�c�b)��b1�c��dcF"(�E5�F#�1��!Y�F"��c��b"�cFVF1�E�c��c�E�MY���1F1��Db(��1�f��c��bk"�c�c���Ef1Fk#b1��f��c�E��b(�F#��F#�b(��b1����kjj��UQWe
{P�r"���)���TU������f�S�������R�Z�"��H�/���}���[��R���k���;x�p"��5����E9�NW����tL�S��)��x�Np��N�^2���j����t���)			$$$$&B$$$$$��S0�$��BI	C!!!$��	���0��BBL�L�$$$)	$$&L�C!��BBI			L�BBBBBI	����L$L�I		$$��
BI				$&H�C!!!!!!!0�RHHHHL$����BI		�L)			�			L�$��		�2��I	$$��3
I	$$$$�&f				$$$�L)				$$��P�����BBI���d$$��BBa)$$��BBBa0�d$$���HHJ��!!$��a0�d2$����(RHI!!!!0��!�BBI			%��HI!$��HJ			$$$$&L)		$$$���C
I				$$�L�2			$&0�!!!	!!!)�BBBBI	$&B��I		$<x�[V���kTU��v�E:_E�^�z�G�z(^;M��(�\{xH��E8 ���E>?��E9O-����W�j�"��.Q������V�����H�)��**�DS\���
�����6��~�=m�"��)���m)��IQW�eW�R*�����b�_�E8�"��%iR+���Ek��^����!����E�	KQQn�j~�)��%"���"�����U�_���5��W��������������������TU��)�� .���Eh����d�Mgs�}09���������������>��@��{�����6X;�P(��PF�t���O�>��#����{`����p@������1�]SE�.�q�{b�����2��@]�sM
�c��
q}j(��t�wE��ZPQG�tf���
H�*T
:h/��>M�4���d�������7��{��-��h4[
�� c���g�7Y��hA�GYp{x�Gv#���b�la���6ck5����8��@���z���6�h����U����l�E
�}a^Z�)H����k����,��6�`�0h�R�H������ dh�� �����@�R04��MU'�J�O�P��MT����� 4M
5MP�d�'�z0������I�*bI����=������UH_�E?��MZ����V�}����b���\(�N
�N$���'"�
q"�W*��)������"�O�w9�*��F��+b��TQ��X��V+E�TTcPT�2�X"�TUE�@ADADAADAADDDAAADDADDADADDDDADD@DDDADAAADDDADDDADDDDDDAADDDDDADDDDDDA_�mm[���f�O��{e�����As�|��O���")�R+����"���"�����(��S�B��

�����`I���Y��Z�l�q1�
��
��
��
��
��
�LI!' 6`7"D��
��
��
��4��r@6`6@6`6@6`�HmG�$�nD�������`�
�� 7"D��
��
�� 6q'cL����$�l�N$��6`6`6@6@6`6q'cL��������1�
��
��
�
�
��
�I����nD��$�l�l�l�N4��4`6`6`6`7"D��i��r`6`6`6`6`6q�CL��������1�
�� 4r$H
��`6@7q&4����"@l�o��+����VV�n���b��l�H����8�`�
��
��
��
��
���I����l�l�h�h�H����1�
��
�
��
��
�
��I����H���$�h�h�N4��6@6`6`7"D��
��4��P�r$H
��
��
��
�mF���nD��������1�
��
�
�� 6`4`6q'cL��"@h�h�h�l�N$��$H
��
��
���@6`7q&4���"@l�l�l�H����1�wv�q��wlW��v�r`7q&4���"@l�h�l�l�H�N$��6`6`7"D��
��
�
D�I�0l�h�nD���$�N$�LI�`6`6@6r$H
���mF���l�l�l�l�l�n$�Li�`6`6`6`6`6`7q& I<���)��`0R+++++++++++6���J�����������������ef����������������)��`0=�(9a��V
*�E:P�QA�1�B* `?��|����������wq����q���7w�����F��7w����^{�E���n�"������x��������n{����z{�n�#wq����n�"�����Z{���wE�{��{��z�����x��{���F��7w�����^��y�w�t]����q{���#{���������x�����<��7wz/{����|Z���>8���F��/=�<=�"�O{�������x{�F�����]�v{�E������/����^{�E�{�E����tow���n�7w����E�Vw{����^{�G�����{��=�����n�7w���wF�q�v���qz{�F���{�F��7w��������q�{����{��������q����q�����n�7w����E���n�#wqV;���F��wq����n�#wq������F��.�w���=�#wq����F��wq^�����F��.�w����F��7w�����^{�n�7w�����]X�"���7w����F��7w^��������7wuo]��7w����F��7w���n�7w����F��/m����#wq����n�#{m^����n�#wq�����^��V;���F��7w������wq�����n�7w�����F��7w����F��wq����F�����7wv��F��7w������wq����n�#wq^������F��wq����n�7w�����n�#wq���tn�#wq����F��7w�����n�#wpH�T��IU!�Sj(:h^�S�"�%K�*K�.I.!r�W�I.J\�����A�IjZH�U�I��9v{�E�{�E����q�������{�G��q���^{�y����^�{��=�"������/O{��z���/g��DKI��i"�ZH��� '.���/g��^��y�qIi"�U-$R!'Z
H�������q��q���������=�"�=�"��w�v{�G�����{�E�^�k���w�=�<S��/S��/O{���m���>8���/O{��{�9s��/g��^�{���;���	KIjZH�U�i"�����q�����������7w����v��7wy�q�������n�#wq�����^��{O{��t^�{�^{�G���^������F��wq����E���^�n{���n�#wq�����n�#���/O{��������F��.�wz{�n�"��k|n���/O{���x�����/=�"�{�F���{�E�V����{�^{�^{�E�V{�^��{���F��wq��{�7����������x���q^�����{�"���=��q��y�qz{�G��$Cj�$R*����`H�9hE���o{�"���/=�#wq����������wE���^�{���q�������n�7w��q���n�#wq{z�=�^�mm[�6����[S�E:�S�������?g���?�����Y�o3���n�t�4�\����z/#^G�.rbF�3BW&c|��of��<�n�y7�sF
�V�`C0Hgx��A�[R�����������������]�1�yFjj���Lj����3�0!�K�ml�<����9�����8��n�h���b��o9�������r8�`��r]�>y�wj]�so1��-Q't[��������F������I&n���U����<�9����C��g%\g%�98bR%^)c"Qg4�����{���s��
d��fn^Qp�l�@]����wNH����_x��L���VdZM��-���Dg��d�/<��s���'�Cy6 �������-���s{yr��t�]]zm�e�����Nb^rwnyU9�x���g\ZVw�Y��y���[��Z.O<6.FpK��a�%����k�,�,���u����4K�iTR>��=�,�^\DrdH����b�wIps�`���^
..���[����Q�/�J���B�f�Z�+Tfk} "�(��lE'�O�E=�S�"�"��"�E/�"����)�@_��y
*�(�����E���S�S��N�E^DS�Oa��OQ�R�E=dS�@��S.�S���S���QA�(�;QA�?�����]����]�_��Kwov�.���WiX[�j����V�����}ww����w��m���m�w]�������wx��Z��������m�W���������]���m]�Z������v�����n��m������<������m����{��o�?]��Z��������h.��wowon��owov�wwe��{j��-��������m�w�������O���,�� �����v@�A�9*d`��jjv��1jy�v"�LE=���X�}�!|�Q~8D_b���H����H!"$���� ��� "}(���U���"�A�O�E8��"��E^�)��b)�E>����'+���s��E�E>�H�S����QWAE]���E=��N�S�"���N�S��~O���|>_��x������w�n�m�� ��4@
�ci�'���c`b�b�n���`�����44@���m�b�m��cm������������~����6���`4

1����z�
� m�m����6�lm1 @Xi��lv��b@�0@`� ���v�wb��1``���A�]��&hi�n��6�A�� v���`hh�`cm�@�1��m�����1�� O:����A��7�m��

1�o���4�@ ����8L��� �lo44�����o;�0@������-�dgv��q���2�m�������nr��wt�����/m{.t�}�,����8#q9V+��-�����Gtg)��mlnu�^S���v�"���g^��s�
�mb����� �KCYfu����L�yW�{*9)���e��'��J�z��9�R�|&��G*eg��Tt��i���W_eD���Z�3<�%�l"���!,!����')eRF�	���3L�DL�QI"���e�#,h�Ii ���dS"���9m,Vd���n������rNfm,�a$��j�D��E��y��	�GBlX�C2H�)�dD�I$�C"��b�B�nK2XB�2��0&Bd��2�"����d-�{�0y,RI)�r��r�$�d��o�-��
���H�M�%	P����$-������J�$����&�.�fBlV,���l%�I9l�(aa�,�I��i5_!�H�(���}���B)�NO}�B�NI��Z�5h�f�eAj+TZR�[P��-l�����UF��5#R�j�V
Jl���4����%��m�H�X�lj5�j*KL���Q�Z�h���HY������MCJ���cQQQ���P4�5MF�����K�f2�X5)�5!��M&������&�Q�ML�T2���CeF�*KR��Ih��T�!QT�%F��L���+QP��Yf���cSe�C��QZ�T����T�d��H��,�
��V��
����*Q���K&eA�,Z+Q�aR����**MHl3S6�jMJljMJ%�HhCb�j�Z�jl6T��Q���e5&�kEQPjSPl��4���A�fT)T��M5Im"V�51M���6B�R
��5EF���(j�Q��H�F��	�h�J�VTjMj*f����R��Q�������*-�E����15E��6*�Tj���j-�ePl�VCj�Q�5��MB���Em�U,��j�QR)�J�QTmY*Z[�P�Z�l	RT�j�Ih�F�h@TO�N)��0`�}��}�j@tR�@�h�`? �t�H�����(����l
��`;�b�j@l��@Ek��������*���qS�xd���'
z��U��
�|�:��.(��63x��d�����VO_���j*�)�����\�*��CWQ��3yIF�N�L�cj�m��W��s���g����M�y}�$���mn��
�/����*ov��6���On)�K)t�M����R����un>��X��8!o+�J����6��Nz��iGR�u�i3b�u:Y��z�g�U�����9�bQ��T�om�n�$��I�����5j�yZ�k��+.��*�R�P��KO4��������`J*;���h�N����������[6��f�i8��\��}�������w�����u|��1��(���w|m�������
�.L�F��)��3�5;y�����;8�$c{���f��������m�q��$���b������n��O�����Qg������t*���*��V�:�N��^����j�)om8vm��������$u�������9q=o7"�#]���]��ev�1�Yr��B��l�"z*�}�N��nf�5u����el����]����������IF�9�q�6t�����W��B���{(����/;*n�T��������Gv���w�TJ�\�+�N��{*�WoQ{��m�U��8�����lL�Vl����e��u���vNP���y�����M��f��2���W,���sQ6��0A�]��&hi�n��6�A�� v���hh�`cm�@�1�6�`��o;�sA������A��6��� @����CD���0@ v�b�`
�cx��� ��4�����@����m�6�`�A��h����44@�6 A�� ������lm1b

 ������l@�0ov��b� w}�	��1�lm1 @6����@���hh�m7�]��b� @��L@�lm1 @Xi��lv��b@�0@`� [y���b
��01b��4J}3����%4h�Q�F�1�4h�4h�Q�	F4h��4cF�%�Q�1�F���%��%4h��F���	F�4cF0�h��1�	F�%J4c	F4cF���F4c	F4h��4a(��F�h��4a(��F�%����	F4c1���J4cF0�h����4h����F�4h���F4h������1�F���1�F�h��h��4h��h�Q���F0�cF�h����	F�h��1��	F4c1�J4cF�HHH��4h��1��J4h��h��a(�Q�%��4c	F�1�1�4cF�a(��4h��4cF�a(��h��J4h��F�4h�Q�/��m���k��p���:��C��@:@�M��<@����:�;V�7�������YYYfN��� ��;�`;�.��T�5JE :T$$
�I(*2%���*��3�5��s|�8T�F��f�o�f�][��}5�,^��z��N����u��/��n��2��U�����
W�|�E_��P�(�����!$�@�D�	�	3$�m�I
1�4D��T��� ��fA�SA@d��J#,��%�M!$RE������1�0�(���$���JLbd�!��!5m��)H�2�b���2iD$0��$P� D$�$�&2Q4�2,��Bjf`HL��Q)�,�$�"4�E2�1,�E%0��S �I3d�d$ l�d�I)"
hSB$�E�P|���E�|gP�����E����[��$=K04����a�����8����/v����WS�/Xd�������j�Ri]��J	mO�,>\��v�_Z���P�y33H����>�Tdu���:��"���w��F�Q�Dl
uS��t����������.���  I���;�-�D�K`aP�D4�p��gde����4�V��w���]Q��q��D���"�P���I��l1��@
�D��6��0A�����1b

1`7m�� 
�ci�hh���& �b���x���[y��0;����t ���6�
�CLA��1���vi���lm1 ����� @`� ci�@
�ci��``�@�6�hia�����1``���A�]��&hi�n��6�A�� v���hh�`cm�@�1` ����1�� Fu�so���`���1���L

;���� @��`�hi�7m��f���C���6����m���n�m�_TE>r)�u�����*6�]\1��.5��Qe���8]V�g2\N��_l�f��qOop��f��=�yN�����z��c;	����A�VE��wI[J�m�fe\�]��8������y�u���vxy,�YQ�����JK���m�U.�]�d��l7��TR�S9��k�q�z�+��|�;y���+u10u�T�b���vvB���)�X8M71�geKN18�=���T]N�Mp�xu��o���W@���WoT]�:�6;!��r��L��l�}���kSM�������wnr�%WK/����L���a\��t�T�����Tb[Sk�6�B3���YQ)N�Z��3&aL��7�6l�Uf�n9��E����M��2�vn�jd��r��n��[B��=W��S��]�e��".��#V�7�R�1��Q{{No��[:O�E�k�j��������(��'�U�Uzf����::���������1���j��rw3n6n�Ru�3���$�Kn��.�;&9��yS�gQ�U@�M������[%��Wq���!��N����Q�H�Q\�Se$������{=&�����DgG>��4�.o�weRS��3�}]�W��N�E�N���o�����q����fQ([US�*5M�wj����l���Sw%���o�kl��6�TL",���v��m2c�hId�a}��2�b���E��+'r2+o���*`�L�|���P�A�R*��!y�O�QW�"��p�}�S�����y����������1~�e������w{��������u�9�o���v
��*!]��I-��]�I,l\��zZ,��I ��i���oj#okJ�"{Wb{7�j����17��M��b���}��}u�]8�6�m���x��<���a�r�m�m��`�<H�m��Yf�PL��J)oo]-����;���(��.��z=�UJ�/:b���:�aS�SC��W��������nS�%�wZ��A	$��g�4��hj�HB��7�	^h�,���8$�
��{��^��~��<zA�,���,���,�,�2�,�,�,�,��,�,�,�,��,���,��,�,�,��,�3,��2��,�,�,��,�,�,���2���,�,�,��2��,����,�,�2����,��2����,�3,���,�,���,�,��,��2�2�2�2��,�,�,�,�,�2�2��,�,�,��3,�2���3,�2����,�,�2�2�3,�2��2��,�,�33,��,�,�2��,�2��,��,��,�,�2�,�2�2������,�,��,�,�,��,�2�2�,��,��,�,�J��������@1D{"<v{����b=�wj�E�W{�8��
f�����_����?���.�����3u[���"b�El�!"�;U�.q��Z���o/S�[Y�V����W���+S�i�}��^yZ�������j�����9�E��w�>��6:��f��=�J�xg\y��WG���|��uc�����|�q�G<�������{����"�����~`������~O�_~Y�Y�vn6�7�y�nk}�����}sP�
�~e���W*���]]4sz�5y��������}�&�[���#��miw�7��c����z�lC���v}���O�<�����!ca�0jrM�=k��3k^�4�!^<��GE�c��{�<���[��_��WJ8<�?O����,�������h�����Ly�j_���~b�'UG�>��"�t;@G;��DHza�U�H�5s*J�wkx�c�HI �������=I�` ����o���vv�7/����"w���qq��G��	t����~���sn�\)������\}�zs��k�+���+K��+�<��6��a����G�k�12I2�������/A�<���e�Ye�e�YfYe�fY�fe�fYe������<M��bf&7ostH��������s5=Yi��l[��q�E�c@Um��#���{�7q���EZ�F#p�N�m��#��W��`<N�
��7R1O2�v���p�}����������������g�[����
���]���6���gi���d��y��;��=��X�#H�N���d���9��FF�H���`F�L���V��  R��/YU�,��Y����=�L��K7��ZwN��]���jNz�I�Z�U�w���'+�g�X�I9/1���v�P��8���q
�U �
v"N�BA�j��B��{P�Yn��d������N���s��c��)���S���������U���m�7x�1n�3��������do##^D���V�-V�'\�6V������]��?��:��oM�N)*-�2�����������l<��\�����%�|mv�������$9�
-k+*�_9�������s��\��^��^���o;j�U�g=M��3T�U�^�{�A��Cmn�)h�7U������0�A�u�'�����I���H�/�W�����K-6���	��5wu��H������]h�Vs.�A��(���U����[�Z� *6;�`E���"���;���J)"IT@�Uf}we� U��Hr�*��QB�{�M�wn�#H�T ���������/@���Z�Sx^W
x��B�Iff�r�����l������<�S�F��x���"T���Lx`V�]#d f�n�!���e�k5�x���!4�ID�Y��P���F(iC�c��(L�)�m��"f��e��cS����@�g\/�=������t���t�UX33:�]_2$ X�If����B3���.���yn���/B�poj|��m�P)u��Xh�����&kJh}�,��2,�
j���<<���B�AC���=h;��eF�R��!��.�Z_����UV���$��zs�3�J��?7{�����@�aG��$N��Ty�Lzy�w���/P��N)������fc�e�f���ou��m��`����aBPI�N�P�y�9c{>��g�=Uqwk.�b}�iRs�E�M+�����r_��~����8�����-R�y��DNG����Z��� 
g������<t���>�����0��9*�G��Du��D�mvf��6�%DQM�(�D�^���Pok$���E2����Dl������O�PP<�QA�������Z�|��n�`�G�#��A��KX�m��II$5����u��7��%q
c1�g��8��>�����u �Y� ��kP�+Oyz����+j]�:����������s� ]���7w�w�&�f8�du�J.��s�S}�����.K���M���f}��f���I|[m���$�@��.�$�M�&��4$�^���<x�<x�x������������<x��<x������
��<�����"�	N�W=<pYwx��e�)�����������l
�f<9`�$��������j��5ff�$�wD#1�u�x��!��4���65�L(�L\f�Nda��f��fNc��H{�d����,�z�����>�O��|�T������n/4����u�����y��h���7�T@�S9�M�B����w�L�����.���y�sR�]j����?k,�,�,�2��,�,��32��2�2���v������f&6��jfvb�5�O�����?�c�p
��86��x��E�--��I*}(5z����T���UX37���<��A�+�������o�z�R����j�vJ�b�7J�Z�z�Y�#	���A<J���/��'/�D5�����{�#$���|sh]����@A/p������g�c�wJ�n�@Uu��\}���5���"��
�/m6�����m�@K!�(�A���_�M]��eJ�K;3���T�LT�z�g�w��>���J�S�;M�w�M\e(c2Xjr3`
`���
��Aw~Z�������X*AA|�6o=�w�}����RA�G9�-s���D�I,���u�	�����Z��'"(b�x0`��������w��P����x����A��h���(�#p���]TD��Q1M:�\%D) �T�������������B.kmu��<o]�z�4FA���&(������m��������M���s������$��v�'��Sq��^���4���4Da|���	��i*�9�9��������1dp�H���u����}�A$R@pM���A*�)�TPTJ���b�n��!�\����g�{�>�;���	pLE
.��Z��y���]DO1n7��A�_Z���=J^��$�:���o����B9����[�V�y���x��3�J� #EB.P9��H�rg�M�����B�%R���Y�h^L��x�o/�R��H���3�����! t
��Awy��x'Qj�)�o��_]r��Cz�q�S#�5�`\���ro�tlG1A#�8�yia�IK��oqD#�$�d�:`q��LE1VS<�� �#����]eP1��5���,b�2 \P��h�������pV��UP���>������@J<8���6&�Ol�R�V+��~m%��;����w���Z+]�w�������{��������\|h�s�:~AS'���5y��D#�J��-�L�ZBx��6	 �`��{��;^��U���7����>���������P�~��(����?f@H]�*(/>/YPCg���""2i�l�����5�	��������i33��^���u+1���P�B��N� ���U]�wz���Pu�.m&�M(�����9q����\4�g��n^�������V/���D��v�x�jr;0_LR�����)z���Y�u�C���R�39�UsK�{�-$��G=X,��8�3��,��PJ	@�����L��@^�d����q��E�w�2i
c�^7�>4������ A	"��k��5���^�M��`6FW��n� � ~��t�f5�Do�>���:��9"=����P��������oW��p`�����J�bm���8����PBAU��w�3F�h&k@��u�p�%	B�CXF\G�&��5p5I'S./�^�V��;�������j�0�7�w������dw��0y��/Z���������:hz�]��=N��\ ��A����nU�QI�O;wRN`�^3���I�b��8��'��%�<������P������������X���",�jSI|Z���e�W|��L�i&���^G?|�u���y�B�hg���Z�����������1���y�
�X�=���fL���/�Y��E�}MBw5��_�W�w��jb�7u,��k~�`�������n7n���\��������%�n���|7�x��������39�
-���[�o}��q���9bY��WFw�\l�5!��h�S[u5����S�.���Q����~j�@�N�����(;�g?e��qS�H(6d�f2a��R��&�����e��,�G�������3��������nm[����N���sC�Z7�����D{;���0��=:)%���y�{�H)���%,�Hfu=���Wq�\��J~�C�@y0���'����w���37UWUUYx���&RU�#����z#�Wj[��zU�=N�xr��3����9	���j���I���-��1�u�^i9����S[�3����r�d��P�
��$�����H8	A&��O�Z���zr��%I������,��#;�b��v���>Y7X�"U:�1E��=#>yx�Q�������T)�|��8F'<�o3{M�ioi���m�	$��II$Q�����(�%X����;�n�c*]���k�Q���I�m�$�'�� m�i���b�m����%e�GJp������3>����Z���8���SY���)�O��ki��J���c��5���=\��G^��r\~����wudfd�+z����������q~3UR��Jc�o]��7�	O�M'i��:l�,����,�,�,�,��2�,���~����x�m��.��_�Uz�&'���p�7���yG���~�XN�6��7*k�~��L�`��q_��z������
��K���n�
�d�P@�V����9)�}���h�P�aY���!��%k:���c�;y�P"���2�BAL�@�}V5���z��!V��"hpmJ���������t��K�`������4y������Ky3�9���4�����e�u��=VN��������
�Po�!0�}��(��K<�,%W��s(��S���v���*��
�x�i����5J�y�[����%,fPe|��ws��.��RIUUT/F�+�q������Mv���W"a-�2BBm��Nz"{���}Br�q)��';��p�b����5%J��	�C���$E�d�q����i)�}��Gr."�
�y{mr"�VDdt*�QT���j#S��H#-9�����q��+D��m��T����-��z����y�� ^��L\I�@&��r,D�#l���E��L+-~<�D��`x���|��jW����w{��+��������������a0�v�4�[�+��9Qp�3�{�_�M����2���S��A�!I*AU
�Ww0x�.j��KV�4B�o��~
�V��R[%�sg�-S��jK=4����� w����{��	N���7]�gB�G��T	Up�RP9�������q��z~����n�Dmn�u`�d�
T�U&�V2�j�T��^����`>��b�V�����X�!
@�	f%��j� �}�t5���f�!�A6�R�
y�l����3BF��}d���`B ��@��)�`�����!��D�:�����E��n=������T�(UUff^���{(�o�'���x��
-�v����	f�J���B�:�_��(�@�9��������c |uD�����$PAJ�z��I�D������Ef�e�HA"�>��]���&VG�}�$��+e�P�`��@(U��n������Iq�zy�@�%�%��gVj��SU(��x���F`�b������X�h�Zy?d�����B���tHH_���<��}wu�� x�I�T��0��
�*J���(]��{���?�|����~`�:U �.�<0�����I�bN[m����s��9��y��H�����3q$�%h������x'y��i=#g���9��w>b��X)�.�"A��&U$�y��\F.f�F��E�d@�m�U���A�H(A��V�}�jb��ww���ow�D�H]v�/}�JJV���Pd��bY����T%S����,��w�w�B���H�X)��
=�I�OUE�1p� C�)Gq��8�a��v�{�@��L��,)E5kMf�@�Ge:h�/p��94�7�i����f���$U��,��/Z����L��K���R��qI�R2r�_D�w�mO��5to�2�f8�Dz�=1��tG�RBI#��H{������	���bm�m��=Sj��`�h��d��i�&X�@�@��,�3�������R>���������9�{y-W}.:q5����4�^_O��>��+�n����Md��������c����q�cfr�����
�f<9`�$��������j�D�G���G��D�&�d�u�
r�#kF��Y��K�0��Y�17��z�����8l#w{���}��`	��7W I������.|i����O\��~�G{*����� A�4��w�0������Uf	5@���U�_r7���7~��m������*���Nw|��y�����e/�5��?���L/���������^o7���fYe��e�YfY��fYffY��v��0F����������?ls�7����a������_���i���
Huk�uhk5%�[f�����9�T�6jMH�e��]35����������7�,���E�E�^}�u^RxR�;�$b<������-�F�f2s"&��F�?���������UN�}�����G��}����c�,��y�{.��'b���^�|�Z�Il�����?[T����`��;���.B	���Y��d�P��P���5�j`����e8�N�Y�'���y��������Gg����@joq������]k+{����e��f���ti����������	
cX��)o�����$�x���(����
^P�`������U���B��k(���.�*QP�A
�E���RH��O������������V���(���9�f�r�H�B�
��1N^*�)�_\�[g[x�zw)m�cc�
�����h$�����p���y�O*�ST~�	%Ms�Y���`X>jfp ��}����] �m����%W��o�S�)Bn��W���13��-57�:����Or������5�� ��-��^�t�j�Ye_��:�HJ(���,�A�����'"#��~�~�u5���z_���-@#���KLU�O�	 ��
kZ���$�f�^+��l�.B��y�(
j�-�9������OW�H���&-�z������D�y|^j�����w�5�P�V�:]��S���`� �<;�\�~ZF}[#����g���k��3����������Y��<*��(������>���TTw,����Z�]k�y�
�3N���;X"�}m]�K:4���+��~�Y_^����TSD#0���HF���:�������+T�"!UL�J'�~=�����k�*��>�.���Y�����
cX����������U
��������L�j!���\���������qM�N����������
D�vC�f�vH^v�Kd�hffH�K��gw�����ifi��5J�j]��%��v��ron���[ob�������s!h�j����V��y��z����l ww+����I`Q���l���%���xpe�����+��=�F��z� /!��@0�R����|��\��
���T�A �AK�0�a��dR�����8;C>�f�����J����C�N4�2�T��P2
���a�~yT�77>.��
S�����S����q�`��H�"xU��!p9���� <�Xw���!��fnjc����}���D\�	 ��{��UE"�������cm�w�m��?������vc��1��Y[�B����o7��������;$�Ly16�4b��P�C�/���A��4h����������Y+��_G'65��T��e���Ow���/s�9c����"| )�.�tc�BF�5!��h�h�
�0zQw���+�5 2����4���p&WCqj?7�6'`���o./��=�����l{�y������^�z<���W��~yj�����c��'�G�%���1����/��jS��u��!�F��-��^5�D��������V�� ���?9������������\�T����LG��TG�vvp��F	=}���a�6�����n���0�n��	=���7��.|i����O^��\v�=�z�\����IT�P�H$�I�$���e
�!$�E$�EI�'IzY����9�����������[��/�&�e^����nLz�r){	���B��������>���V��S���s������6	$�	$�$�wDd�z�=1��eVTI���;os��tf��aK�%.�����01��y�-��44Ib�H����I 0?��}}e�����b�u�(������*��M�3�>���F����wY�g�����y|��]\�RT��f��>��OZ�v�R��}4���G�G��������oV<�����@��&b}1��~���e�y�6Y�e�e�Ye�e�e�e�e�e�Ye����z330�������&o�b���������E�W����
�q�7��7_��3���7��/�L%���������������	�3!��.���H(C%+H�(@��M��h�w�)eN���!��w��p\y��f��	.�Q!�O������'�IB@ ����S��3)Zx6"<�����j#t@�/#S�oR��%�,��j"&�@l�V:��B�e�����ZtR�{��F�Vf��:�j<���p��6)]oE�4��0+yR��b=����cg6HF�b'j	����]�b��j1�x�y�f3�'�$�ABE���eUMn�?��cc��g|H2�`�-WZ����o����nk|��r8�5&@�u�����k5�����]y���(�%~���$$*����aUU�Y����h�%T�%N"9���t!���hJ!"PA��+������i	BI[Y��5SD$}���/4F�
����� d%�m�����h����H��������x�$,����C'���SE�+yU�V�#s���~�^t������#��7�2���������~��}��5������G�wo�n+��5 �^�� ����C��g��L��-�`I(P�L��LB<$V]}�ehXe� ��	r�j�P::Y�����H�!���i[]�_��a�dI,�T8=WD����=Y��K"Y��!WN��(���z^��h��o}�xDc�gUP_�(w�����������}���
t�>�34�3y�&% )��*�����+�����a���:����ZVw�u�&����9�%>i�0eGs%����b�Ux����=Q��L+�.���
��o��R~����;����I��4w�Db����{��*��o������8�(��$%�T�d����]]���Wu{Z��@P2@K7�9�!�y���aHpt����%���+�����J(��dR��8@�����
sl"�n��>��	0$���A��0n�4�1����h��n�PQD�c��<�n��6��>� � �u������	#KU[��k{^���uP��!@Q���F�0c9�]���9y�y�u��<K:������.r�1Rb���	p������&�����#��F��~����z�[8R�^M����G*R�z���."o���u����� �0���<�$���
C>�	���Z�c�n5	��>���%����|t��P�o�Y�c��#�.�a"R%��j�Db�v��
5w<��CV��\�{�����u7�z��5>vba3�����T�$$0�O=��"a��'z]�6��T~������'W���N}���
�ww
�rJ3(U���{��wg����4��TT���f��x�D���UU9�J=#�YJ
�?��su��b0�����r$���[��B�dJ��\l��0������o3�e"Q ��w��������c�����<������{�2���u���*����Sd���W�����E����>���k6��'S]�3i����G$��,ss�" 	�<;��\��*�7u]u]���T��s}���	w����
�m<����$�v�$i��Pff`�A`7�����fJx�u�v�v���H�PK��B���,I��X��k�qj�?q�~���Y}4c����\nl��%�����7�LR��t���A�+4>��������Ivj��Ow��n��Z�>4�F�An4�K0�'��{�,3��7w�-���&7��Q"=�����r
Z�ux���&�J��w���_AS�e�}i}������`���POwt@i�.;MI��]�s����7�&r�J�Z���{���I����'��d����\����MO��Q�\�<�����SZ��k4W�����i�Ua��pW<��Z��k��2�e�����3333�I$�|����T tt�A�3,���3,���,�32�32�,���n��m��` 
�������f�[Y��^m�o��y���c����j�_�)f!��Y(
E\��j��1�������[�R�n����VA���*�]:%��Qh�P��T�-�DR?+�r�?���S���@��+'��K!N���o��U��%��������wo�{�m���q��LG����j6���_M��t�CK6���T
��������/24���4�Z�V���*ky[�,�y�%��M�Q���q��Yg9n6E�[<�_b�&�H�AD8���II%��w��K!eS���=���k��kuw��,�����H$�V��������sw�t����]�'R}���c�0,D
�������m��|���<.P�6��=zG��3�V��K1)j
�G.���"Q�K���d�����y$)dB�@[Y�Gl"��[^���
(&V�w�������
����(I�����QB�����T7��!����@Q���o~m�g������S6K�}=q����<�0k�q�����S��kyMdQ���Q���R�>M"[*�Gm�PQDS^����{U6`�G�[����@G��j!�ZcW/�E�,�L����j-�u*��X�5%����������{l�$�r�$7h�\���h�D�����P}}y��Q�H���T�=dj?r�'�. rj���v���>��:q���[��@"���c�1�>-�m��7�1f��%���rB	$K<�0��"��_*&|w��^%�#e`��z�dd��o7�K���lR���D8-ISP��FH%Q��}+�b�\3�����P����>�1�������c|��D��~�N��J 9����Q/�����m_���>�[����� �	%	(j(���<@�,��i�������an��� ���if�=��f�j����{�@JL�f�Db%�_��j�
�0�jK�L'���X���V����|O�Zk^04SzD#^�%��?uo�M�}�;6��H	��]dfk/Wu8B�2c#��9e���e�s����:Y�?��?q�����L���-w�HJF�(����pL�>���"X�v$#��F"��
'<�	$)��f�Q,�	���w������"�������P�Z���S@�R������V����lkA���rwd?�q�p��(
@�K1���<Ub'��� ��;P�PK�b�Y\���������|VW�\��+[�`V���T��?<YS��W������?-�*��[{��F K|����zw�3���`�H �.��0�� Kv�GLG�w�`n?��z<��z��T����QP�����oK�Xd��������O|���N ��K��-E�9;����������1)j%T���A���D�:���Z��g{����Y|`>��}�����z��$���������\���\���j�s
����Q_9��(�n���W}�u�o�����k�o�`���|�rZ����?��yU�;������mFo��������[c��������]����c����w��k~s� D=��a�~T��X��k=�"=��c�S6����m��wv�������\v��{��R���B�w!��������z|w�������FD3�� �,�Wt���h}5�O��>u�;���5����f7�?[�0�����zg��3�Z_#z������J��wui=�L�Z52�
=Pf�����[�|1$��0����A��.�g$�^���%����7��#����.�6J)���S�����������.�^4lM��I���4�<.���[�b��R������/[�:^)��~+������>�������O����e`�����_w�H���i�������wS���������%�e��+h��s���o0c��� Q�����Wl�}�^��W�/%��^k��fe�Y�Ye��e�e�e�e���x��QI$FO��	]�����j�v� `�5�I3]�?H��3�gbE����\��:��Y�J�T:�( %��S��]�q��q�fb�s�������k"~GV�V�?���;��/�3]���������&���s�vIK�����W�������']��j�Q������}F�q"?A$!;=<�S"4�O#���6��WQU��T����P*:BY�����]�����E1��z]���as�_|��\�iq5�)3Qk�5�s+��1������[�]�D@�����"9���<����;��B�����@IBLn�{���'��*��z�e$/����@�!!	fm��F|D�/}�z�s�X�g��9[�����V�N��������W���h�-�y9���	e����e�����z��U���_�����~a���Mu7����UU�bg<�EGG@K7��4E��
J�����MRA�4L��F�4U$������< hT�:%�����}���m���}}���^���l���k^��U�h������D5���)�{��P�/����o�"��$��!R�"'�BI5_�&^�*��h�����G���T
��:Y���M
��_��g�&�C�K�;��G��C
�J���Q���c���9:��<C,_Y���/}C'~*�"�A;P9���Ac9�;{H�)�����I!?T_��h�������c�f?� �Q�?����X�O�.0�3	��+b�Ttt�8��!r�:E2�RW6�4�D#�$�'��p��A�����2��2����'�J��u�[/Y\</���#w���l�u������"l��1� C���k>k������=�;�F��� #�B���d��;X�**�����
���lY�6��tb�'���D�N>�>2��+�Z�u�Q�%M�wy��_^y���z��W@�6�(@�j��	�a��~��CVz� $��f���������wf�O�P+3y�1��<5��v���0A�u�e@��K37������)k������;��7o��K���`*���ok"���Ae��Y�^��|�|�JV���(���3@%��\ R�ua�DUQ�������Y�����O���$�ffg>�U���$�RB��9��7�����-�*���>�~}���W%������E�N�I�}�����EUg�n��+�:cY�k�����pkg�	����'���~,�#����������h �x0���r��^��*�@�9��Dh<���dA���H�fg3
T�U�������;A���z#���2���K��G�[�6��%%,����S��
G�XD�f�3z�a����y�)�1���g��T�NDB�1��;�$�q�F��+D��z���V�QI<#S��g���_�GCj<����l�{���'���������**�x���� r"���E����A�5=E]D(Q�;�/���;6���S�0�OQ�����7�<�qw�����0Kp=��{yI�^9/n����|�@{����3��Q3s��Z�������w�`+��tG�#�U��,F"�>C��h7^�.�^4lM��I���4�<'����_]j`�St�z����:@����k\�c4�*�r�;��Uj����\}2���Cj�t��Y��j��e�4
�0��6��'#�T�<�nH�u����)-�X�ov��{��R���n��M�m��;��$����w�m��MN��N��O:w�Zp��6
�z��5��Q
c����\vE���D>��/m�7a;X�K��I ���n��"r:u~�������K[}sYt��Jp�)E�{+����^E}yH��WU��_�=�q�cr��������E�������t`�<�13333v�|O~��k��Yxz/���,����3,��2���,��2<zA��z:#�$��j#���L(�����������Wr���;���]����2\�Q���iG�T���Hk{�@��,��C1�|�c���Ks�(3e9���9E���Vb���Q�s)��$�'~>��>�����*�|����j�R���7(���������;��f1�ewt�3�(��r������*-|����<Z�WK^��W�c�e���+�P����|{��g����	���Ww3��GV�5���������bY�/���4��I(�o,��z�1u�����g�Is����"MSQr_:����#	�C��L��V�#z�o.X��F��+y����H��.��3%F�$��L������ge�t��,���������#UF�w���.H�0��.��C�31.��BA�S�!�|�ur��l��jp��mkE��t
��K7��lbUBJ���*�[�&�R�nD�QI��o6�9�r�#}�/�������]�J_���VLK�gQ�cp�OqL���?�]�%I��r�����X\���"������PB2A,�]|d����5C"��.��x9�k��p����RH�������v8z���Y�u^�?%	��a:�����	�y���'w��z����;<<x�����>���O�yu�b;"q{�T#"R7��C�Z��J�9����Ou�����)��*/��h����I&g��u4��F0����a��.o<����@Br���>'�fA�!UP
�}_�sY��_xs##!$_����}�!O9���C��g4���	���[����f�����w���f��8*#�^���5��j��O��F�S��w�m9T���t���U����Q$�N�c�D*BIP����]��Z�U�g�:7<^�IuR��Q�3r��A5�����G�Z��c1'����Q�@�_(�����ngl�'���(A��c;�$�2��e.��)� :6o�^������V��Ks}�z����{��w�_�R��#Q)8?��x�?#��b�,Na�[��Z�����7b���HC&M��R�)���6�H:�4_��� 
��~���(_2+�L[��7��$�w�����F�����M��--�g-qc^���k<`Um�~$Mw��|�'�m�K0��Y�3!0�����|�����	3�����J�9���0�^����6B+�xA$#�j��%]�����q
�<QL%��}L$��n�f����5�-[�lB�!UZ�e�PBA*����O�=6F|�8�$nd��`���H$��
<���z�BIv��7c���,1��bK�|g;�ob=6C" k�I�#��jy�5�L;P4C�����Tj^�8�y���
D��$�$ws����5x)���J���%p��}��'������������~�*�d�	O�9����-��Wi�D�����wz�!���L`��eq:����A=|�Qs	�Q"��r$�~�2O��@�B��U��%�u���u5�D��}o��O�6�z{����A$�Iv�}kSM5�o�����}��m������_�
���sy'��9�n���kKo�W���5�V�����x��	�"�[�KX�Rby9�b�o�����6g)O�o'��%�q������&��[��V	j�t�gz`\B0����o��G�����^��]{{�w���������w�0�o/9����i�����R�|�k�oo^7��E{�C�������������y����P+�����Ul���������� �DD�����9o�������q���"=~����I.�0If7���n��	���}xv�oX���f�X��]n��	��$dqE��s_B������X�b�k��4X�b ��|��Dm��V����I�W�l�zW�����/
��j���4��'�������o����#ou��*��l�ZW��!�e�i��u����S�o��1�<t�����H@��w��nq�k^�A,O`�!��c�n�H\+��h��W��#��D��>�LB��^����G��q��.������������5�F������6��i�_k�����yq������NGoW:��O��5zh���<�~�ox��Mw�!X��/N�#N������|�]�{w�����������ty�j��'��_V�c5^��x" srWz��-]���dz.���y���Zk����}�w�����3�2���������c���-�29���oZ�/<i��X���v����_w[���z���//��q��+�<U���z��9u�U����%��T�����/����|�j_����{�#C|�}�F��X����J���Gu���k��+�����|�gmKDD06c�Ae����v��R����i�x����r���#�I ���
$��fu�cx�����7={Y
{g�+pT*��pi8�g��������b�����]���j2fi���k����;���Z��}n�������$�����cs���#��:��HHk]���lr�0�'O��:zzzY�fe�fYe�fffG�=����G�I  �����bv�*�,�Z�N��?��I��z\�.���sm�����:l�5�c�%.~�u�=�"ow����JG_v��&�2�K�:��F���M�Q!��1�������9� FJ�.!�H(u�~���H|`��h"L��U�����~��A!��rx�����������=����e��qj�n�R��7��FF@�U�3��%�����D�	$+O���$�Y�I�Y�����6eR����'��C�&������������P�H$��^B$�TR���?��r��v�����JA�50�?n�=th���[k�-����! UV�����R	"�*s�9���	�w�S5K����Gv�
IAg�]��$*������M l]��S{/c��|
�naq�U���d������Dk)�Dlk������PBHEUbY���~[]�L�H��\��W�PIw�\Pa��{��QA��}U~�5�_i�����������y������7����j��d|*H����t�h�P�������,sJV���/�?  �P7�&!@�\7����wb������������:Udfi�����|�`8 +V����������w6�Fs
����*���|��|���������_�4��\G��@A�O���1�j�"eV��_���q_<�o��<x�[|��A6kZ��.�'�w����5s]Y�����
Q�E�j�KYOM������A������]�6D~��k��|��j�eQ�@ ��Y�j�hU
���Z����{}C|��A@��UV���D������o�x[f�$��)��2�����L�)���^�O��*����[0:���y9��e�V�����?�� �P�������p" @4�/�,����0f�0����t
��K7+���
%Us�z��:�L����PD �s��up����xGH �h�|��������o{�	���e���J�������[����@`3H�?v����+{�6:��y��N.��l�<��u�Z�n���Z�z���g��V4��.K:�}�@4pi��@E+�n�|i����|�vyj��p���w�DSZ1�a����%$'������}���1���_,�����s��������2U`���d!19
�H,��1/"M+�����99����3�@���;�q!��f��65W+I~��wM��w���j��3�y�7���3���-M�x2�>^�s9-�;���U��U�9����D@���?2��s2��233\���uD�����KSi�x����r���#�I ���
,��3:�1�i�x��\��]��*{�=OV��w��D�0|U9����������m�`��4q]~��\�������Z�k2����|*J�M<��U2�[N�W2��������5^� ��H	$�� �BI$�����F��w��m�
��l����{���G�l��6�`������wwp�o�wN��������m��m��m��6�wp�`6�n���m���m��m���}�o�m��u'�Ncm�p�l}��6�����	������''l�0�7�:7��uDz<���E��E�W>I$s�[w����I6�K4�X��fwp�I��M�'�%Fw�Q?W������W+Z����Z9�6��U�.�.��N��u������/qme���i_iW�����Pl�[Xw6�\�"������`��b�_0~33����';��]='M���;N���:zzN���,��3,�,�2�,���,�v����m��m��[`
�.2*jb�'v�7@3333333332�������C��������kq���������y�����UV�����IDK2�(U%Ws]����*8�[�_`'�(�yS� Q�������0�h��!���F�O�����H7vo7S��M����������\���S$L�>da
X>
�^��t�UX373��l���i	�q��(����'�OU'�1O��MU������b]��&�|��3�\���<f34�i�b
H��p�$		��9���������i���������,�����f�(��%�W/���o3t�j$I�����^,���u�B��B�&�3}��Xuu�������!����b>	$��3�h����fk7N�xj�
:@���K3}�"���=	}J���T����������$���
�
�#�c+�+!�j"6�\q�A�&�����{�f����3l��0�CT�W�3�p�&�T�����
��y
�h�S��<k�-���D�9y�	�W����PB+`���A�O�&l�	$�f�"�'�a������=���Bo�n�c^�\�;����h�	-U&js�S@�DE��/}���oe.��9��X��))?����6g�!2�wP�B*���%�"A����`��pmWe��
�7��{h���jr���f�5�DD>H$!�i�2�j���gfcp~�����j7�"�%�M�[W��~:�U��wYT�~����R)@$'|R
���������%�uJ{�@�� ���Vo?��?VHIHI}���>����$ #��E�-V��������tt��������S��(� �@U�Z��2���(|@U�ET������ITZ�]���z����cG&��Y\������tP vQ�R6��|������to�1�;�`�O��sr���\��T����'�������x���I�.C2�jh���::U`���n�����W��A�@��=��_�X%5F�z���J'xU��F	j7\�|�~�F��A�'�9{��bfb}2�[��s���m��o����uaE�G~���?)�,�7V���ttUV��Ss�.f$;k]	��`�C���sn��3<�/`�FiK�f D'��j����W{�L�i�W���X���sZxS�"�~s���<�{����o��y���N}�9�6�;��n���g}nsbL��?���5����3�9Tj���������_y�sW���(A%(A$� ���!� d2�C&C&��C&L�C&@�d�d�d22�L�C&�&L��!����d�X222d��2+C!��d2��L����d222d�C�&C&C&@�d�2+L�&C&C!����`�d�d��2d2�+�L���d226��2�L���d2�2L�L�L�C&@����&C&L�L�C&L������� d221X22�L�!�!����d2�C&@�d�`�d�d2d2d2d2b�d2d22C&C!��&C!� d�d�d�b�d�d2�L��l���C&L��!�� d�d2d2d22+&C!�!��� d��22d2d2d�V2�L�C!�!��b�d2d2�2L��!�&@������ b��2d�2��LV2C!���!����������d��22V�C!�!����d���2d2d2d22C�!��d2�C&@�b���j�^��kj�kj��H!�����HHH�B��A2	��%��-��&��jH�A�`D�Mk-�$ ����H&��i���H�l�@H�HPb��l��D�$�F�d " �i���"DD��@��"�Hi�DB����� �� D#h�h�L�!	����� ! w79s��'kj�y�������D"B��$@@�d"!�
#D���:]:Ggn�[���;��n���!�"P`��H�hmD@B"$Bnw]�������HUeV����H! I�9�O�E?E:OS���DS�DS�S����"����)����E>�)�E=�S��)������DS�
*���~��lE?�"�����)�"���/�NNDS��S��S�O8�{H��DS���H�R)�E=����M*Et"��)���Gr)�Q��S��S��)��|t���O������p�/X�������N�S�E9O��p�p�1�DS�NDS�"���"���")��N
*�D��b
('����HTeJ>�g���GS�����k�D/�n��������wV�q=lg������w��q^3y:�;���\��:��7��
�k���U��}&��
]d������\M_+�������si������s����y}���qt���({�P������}����>7��@�O�}�p0�4 /���cY���+�;�<��j}��E�����`D#��nw��Q���u��{�W��8����vw�����
�����"�uH���S�"�"���*�pJW�E8��{�UW�O9US��U�O2)��E_1�B�B���)��{aE_/���R�A
?7�������;����9x���Q1e}�Vju5��q3�q���w��]Sr����������#�8[J0���7"M7nB�����6�DQv�>3)���`�Q�$h�!VE��N867�H�����2{J��
��Q�<T���<����Y<�i���HR�?�D�%�E��$%
�{�����
T�j���'FIq�W���������et:x9�������k��LuCy�x4^	����uo�'���Z���S�5�C��/5�'dQ��'>�P|��_�p����^��k0�����	��(?w>�UET��������������������\��Po�
 ��L#5z���j��0��F�gq}������j���g������.��B}�I�
�[9���^y�m���Wg����xGaa�Ov���R���{�N��9���=��:��!����O�����0�4�t,D+���R�
7�Pq�uEUUP�""I�w�7U����T����E>����\���
Y���������k��S�gs�G���D��O�%+�Q��E<�{x��oj7
���;��Rm(z�t�6E������N�M(���9���j�>W���f3�Y�W6-hZU�����o�)�����p�����.U��w���f�Fd��s����ymW7Q���3*��{�[pSWs�.���s���h����{��t�2��yW9�����6�����^��L�]����
nNl���R�,����-W'uN�U^��nN��'1qU5^���.Eo&������{0��Z���d����r=���nU�wx����[����rj���2E
��/wT�(u�ww�Y ���X��9�����Um��5�;��]������N,���7�8�����r,V���bd�"�T���g
?z�"H�R,��<|m�l��oz����S	r���B������=�����=���<��O�UU|���H��@��S�t"�R)~�E_A��)��|(���
/���U>?#�|CN����O�������>c���"
'��^�Bk����-=eN��{���gp�)>/JQDr����t��S��v6xvBO���%�������@�W���Qx�&�l�L!�(�������H�lD���������5���>�q�GM���v�
�:�Yy �+�-
�A����M|qDI�/Jo�������1
]��� ('^�����y��}���}���x�,eA��w��6��V�3�y��M&��:f�i���LKp����=R�|gq���3?P���QA���x�g��e��U�T���U���W�"�
*�������H��"�H�aE\U�N�(?��~2��5�������7�������s�v��
�"����=�@���C����Ju���<4Pl���mx�'��������J�~O7������w3��b�����}����:���R���~s9k
PAC�~+���ec�����3fC��y��|��6���������~;o~K���C�@Jp�MJ���dt��<��2���
�F����H���:eN� ��@����( ��%%W�7�����!��A@��[��^������0�EE!�(��\���+��y�N����:e��x���.�^�_9��O��|�����=����s��|�f�����T��oX&�M��g�H���x~" ����R+��}DS�E=�S�]����}DS�R+������B)�P�"�!E_I������(J��)�O��n_�9R�g����/��~u����O���G����s>q^�z�
�V���E(�(8w�Ot���Fs�����^��9���h� ?9��"�{�'�����"h(@|*���D�5����N*b�9�z����5�k��=�A��92��r$�(V�{�������yd)8����������N�@�� �}���A�m�<��q��TP��(�T>:����k��V�o�d%��2+a�t��
���n�wC��.�+�*|��l�����#�32;�����{�����" �
/�H(����8��.\�]d'�4\�f���Pz��S��K�
*���_�QWH�y���RK���t"���+�T*:NO�U);O")�"��QW��p�p�}�8���p�y�T�:���R)��E^�(��)�b*}��W��rUT�O2)��x"��)��qE^�R+�(=�����(�����UOS���B�*Et�)��J�� ]��S��W��E>")��E|j�TE:�*�")�E?@��X��
�^dS�T��>����!z�E^�E^�E^�)�"��H�a���!�)�)�)��U���)��|��]��")�)�E<
*�S���)WR)�NQ�@\�/��/�"�H���y�S���S��U:�Oy�I�R��R")�P�)��q��S�T��QW��N�S�$U_DE;Oq��!z���E=�S����H�������E:
*�E>e��"���/x����yO")�)�"�DS�"�!N�H��S�E<����E8������E�E<��wT�������RB)��Uv�UT�E<�)�"�"�z��z�:DS�E'����"�U�E=�S��E;��E9LE>�US�N�S���O2)�QW�QW�(��E;N�����W%UN�S�P�iN
*�EWU"��+6���w��eeeeeef�iYYY������������m++++++++++++++++++6�J���������������������iYX��<���)�����(���
Pr��l����#cF1�b15��"1��Mdc�c�MeF1�6��b"��E5��1�c�B��1����b1b(�k1Db1�Fk1��E�c5��E�1����Q�c+"��b1��k1��F1��#c���#�E��c�1�E��b1��b15��cF#��F1F#��F1�b#Mdb1�DQYF1�b)��#�b"�Vb1��E��c�b1��c�b"��k1��#��k"��#�bk"�c�bk1��"(�c5��1�c�(�F#�3Y�c�1��5�[j�V�����m��������E<��e"�IUS���t(����UOB)�)��z���|�b)�H���{�)�����=�S��W�(?
���A@�8B��!!$���HI!!!!$��HHHI!!!$�������������HHI!$��HHI!!!!!$���HHHHI!$��HHHHHI$����HHHI!!$����HI!!!!!!$���HHHHI!$�HI!!!$�������HHHHHI!!!$���HHI!$��HHI!!$���I!!$��HHI!!!$�������HHI!!$��HHHHHHHHHHHHHHHHHI	!$�������������HHHI$��������HHHI!!!!$���HHHHHI!$���HHHHHHHHHHI!!!!!!!!$��HI!!!!!!!!!$���HHHI!!!$��HHHHI!!!!!!!!$��HI!$�����[[V��mWb)��{��������"���Gr)�N�S�*��S��v���H�*E{�B��)�UT���y)%/Q��W�U�������QW�����"����UH��T��E:����!C�]T�\���R���"��R+����(���S���R�����E9*�tU�R+�����PVI��^��[xB�x,����*���������0���}�9�*�A��Ac�8Z��K�!��$�r�� �)_s�`r�P>�J��6�kA{Y��a����@2��P�Y@�
f����@�4	�ofQ�JPQ�Qm�=����;<������U�����Rt4���\��e�(�Y�;���P���R��4�"i�6�DT�P���h4��"�h$����HI�	5@$�JT�$��
�� 4UJ5<���"������
44
��
�I�bi�h=5FC���6��z�Li4�����5 ���O���*�J*�r���Q����S�2,(�?j)���O ��~���(��hc��h�m�ADDDDDDDADAADDDDD@AADAADAADDDADADADAD1���
)�DS��uB����O���E=�)�Su �B�R�)���)���%�(��$R�@W��b����p�"�VE`������Y��Y �,��Y�Y�Y�Y�Y���E�X+"�VEdVEdV
�������"�Y��+"�+"�+d�T(K$aKA�dVEdVE`���j����	a�"�V
����h6������d,��+"�+"�+"�+%���j�h5	B0�"�+%���+"�VEd�VEd�(FdV
����������Y��P�#�VEdVEdV
������XP�#�+"�+"�+"�Z
�"�VEaBP�"����h6��������Y���E�Y�X+"�VKA�`�������dV
�h6��������Y,��B,�Y��mY-��Y��mY(JPo��C�B��(JYYYY��VVVVVVVVVVVVm++6����"�Q��LJ��EZ"�"�H����B�d�hh���������n�#wq�����n�#wq�������.�w����F��7w�����n�#wq����F��.�w����F��wq����F��7w����F��.�w�<�7w����F��7w�������7w�����F��wq����ilV���,���+lV�����[FCKb��m��[#wq�<�wq����FV����r����[Kb����#wq��8���n�7w,V����K���+ilV������7wy<����n�%����+���-4���+l^#wq����F��7w����F��wqV;���n�#wq����n�#wq����n�#wquc����n�#wq����F��7w����n�#wquc����n�#wq�����n�#wq����n�#wqV�����[Kb��m-��#wq���y������]�w���y�����F��7w�������7w�����F��7w����F��wq����F��wq�s�$S���x"�
)�F�aV�2&*2&T�*2�%0K(bT��
)��NdSR)��2)��"�QO��������������68���n9l���x���j�"VUN��1�I�� ����Dz��M��b�dY�cS�������3�!������J�p��h��`k F��@de��������BJ#*���"����E�Usc�VL�KH0N�NE����0�������k
��I0�+�$�T�6gS���1`3a��N��� h()QV�Z���'-�o3A�t��v��pH[!��5m���m��
6�E>�"�{M(���������E=�S��8�OG�E>�D���$R��1C�P����v"��E1��SB��D�<��]�������I���MhS������E6R
~$S�j)9"�z�_��u�[����t�����v���wz����v��{om�������6��[��;�����.��;���4���m��N���v�*�m��N�:n�sN�6���p�"b���"�-�jq.<OU
w�����W�(��N�*3Y�K1Se~�)��OB)���� ��O2��"����N�����bVe�8�$S�����D�EZTU�)�E<�S���{MO�9�Es��~�������������P�������-��i�0hX�lA���[m�07L)�����Am����-"�lu�n�@�m�[w�`Xh00@:��`7M��f��~����/��X[v�m�,�(���l��I���#�����N�'����-��J�����QE,�����qwIY��f��LuZ}nz�f9�yq�.V��0[��
�]����dD���LQD�h�e�Qd�8���&l�L�&J:��f������n�[{������Z�E�.���e���m�
4�E�X,�r�L��&�����L�C$���[b]X��,���1���j����x����yQL%u|�D��E>
)�SMdQ����mh������EQ�6��mQj�Q���
�)VR���!b�Q���:v��aaaaawS�X^��X^��������X_JPp����.-9vm��O$l��Aj2t��lc�����+r���v�X��Wu��D���dM�]u��S-L�cV���#z��9{v�b�������;��5�y�'j��H�N��bX��U���'7/I�^�����nfU�7r�`���������V�����
)�9�����{���3mm��;������{��������F��l��n�z�q�q;z��t��]:;��I�����V�C��w�����3.
�v�pgvt��z��3b����cR��Y$\�fN��Z���Z��s��j��z{�N�Fh�0��9t���n�m���f���a7�������@�@�����,M�� ���-������n��M���ah�-���/tb� n�l-[�`Z@�,���L����/tbh�����vX @P
�����1�F�h����F�4h��1�F����1��F���%�Q�F4cF�h�4h��4h��	F�%��h�Q�%��0�cF�%1�1�F�h��1�F�4cF4cF�h���4h��F4cF1�F���0�cF�h��1�F�h�����b9�b�[�!p.%��)�)�S�0����q,��YnWm*�m��eoJ�\+l8�Z`�����a�VS����fK?]J���(���OyV���m|_TE5Jb��0)C)�JLI�d"�M$�Pb��1��h��h�5j�5Y�h���24��!4�K2(EI�D���BI��"R�R
$��L���4E:"�QO��i�^���4�@�����-��������r��y�(��&������0Wi�I��F�����6e�*r U(&� ��[��`��v(��5����5�������ot�����v*f����[m�`�A���uZn�Z@���`ah�b
�
ov��`h�����h�a[U���@��F���``�u�6��n���- ��/tbx� n�m�`�����X>��"�@�M�s�4:R�����X��'n�A����oY7R�F���%F��jf���������!�����p�v\u�����'�eeJ��lN����#)���
$�n�������������!�W3\U����{z'D��,������J��&�iV���N
�F�RoD���C��������mN����Q��z3����u�����VU�n-��'�����
�+�n�YEd�#��!��l�h��������c4��&���Bp�^��6��2;�9n�9D�
u��G1����o��zS�3���w�%Pj��Y"����U�".����
��]}^���DSb)��"�����v"�������wo������}�x8�Ds����:�m��OWx�m�����m�8�9s��X���� ������b�7�n�[��S�]l$�q6����E�����U=k�`�J�o1�Om'm����W#ysY�e�e�Y�Ye�YfY�Ye�fY�fYe�e��fYe�fYfY�e�e�Ye�e�fY�fY�fY�Y�e�fY�Y�e�e�fYe�fYfYe�e�fY�e�e�e�Ye�fY�e�Y�e�e�e�e�e�e�dr�9����&$u3���3$�<?���f���k`�N��~J�F���o��`8|]���������_;��n{��t)V�o�|��y�#�G�9��s����>}��_����}}�>�{��3������������_��~�xW����>�����1�.3�����_9����3�V��}pnW�u5�:�U7�S��y�^�������6���l��N��f�9��L��0���N�����O{Ss{5F����{�Y��Mr���Os{h�s��<\LD�};3�9"�p��7�� ���p��������3{�[������5�^�\z�U�B�H��0�T�b���r�����7�/�5�O9��D��#`�Z � ��(�A	�j��%��&Z�����4�F�������`�������"d�E#6�r"�M@D#( `������j�y�g�u��'�$|_:A�����#���3)��hk	�&vw�]_86����d�$�R2Q�r�lA�=;D����FHW����Z���������;�%t|Q=���6���� J�K�o���I�������3���+����D(J��Cj	IYtn�r5������A��E�A z����/|Gt_B1q-y�h�'���(�b�����d��3b�0m��:� 
y��1���tq��������
�L�m�uw��Be�N�?�DE�!DD,7�7������u52�$���{z	]�:E��'Y�;�-��H�@EZ�"N�JD�s>U���{]L�k��o����,.T��;�����m��
m��`X	<I]��2$kF8p�p�p��p���m�N�jvS�����^���}\������(��<q;�V�X���1j�AaM��z����Q��2���v����*C�!z���*O��:���{� z�:�������z��T ��9�j�9�����2���>|���u(���������z������=�}�`����UN�fe�`&�D9��9
�B\�=4�faw\�u%�;���-�X��/R���}�u�{DIJ<u��������?���/qy�������$L��8�j���=�B��L�	p9����������C���!��G���Tw)�k�&g����Z!L	��@�6�*)QZ���(�fP�C��X���MMP`��,T�u��Z�"�%����G
N���Q*q�0u������P��Pac��*f|+�2
�S���%���6m��
p�,��a
�����Wn*f����*�V��AF�@��B��V�����I��D�o�+V�N;
���(�9�+F����C3hHr�V��|\�";	�h�|�e9�i g3�w���x5�+���b�8�:���u0lf��j�}Hk0�1:ej�Wf!�
�L���Iq!�[$.gS@�"�������Y�m�2�H�����2tYK�V�"�7���F� �h��R�/�!�[�y����5`6E�R�1(X:��w��2t9�@���� �?����/�>�~�_��s�t�*~�z��_�`~v�9��2DW��S���G����D��u��Z�O����p7���b��;�"y��Ms�����I�Wr��!�#-��(�DH��3rk��sR�)���~���}���/C�Y���^�<�~����Ig���X��  T������W�����1D��OS���Y��
�` �����M���Y�����0�S��W��V����u��=�V�+8k��<w�f9��*���w��,��v���W}N�lG;����s���7���`�Z�0`\W��s���sc�e�P�V��1����2����������n�
�i� �w�r3��C���r��p������
���]��W���������^J#�����i���l���
n��K�>{Y����g��{w3��My���b��'�8�x��y�������f���H����;�9=���3�\��*�7w���3��=o��`��t`fv8��5�������91��]r�����Z�e�y���(��/:�{�L�S���=�����~�_�&��aH�1c���m���h3�$���������a���nnD�\A��g�UXUw33�*>+TD�d�/�p�B�I+�uUS�>~��g=��T��x�f����S�<�3��`$���&n�@B� t�@�J2IP�',L3���|�Vl�8���vFf2��A:����O����h��2���!�)l���J�G^��;��5���<����������y��($����L���Q)"�G�����f�Lo56qv?I"j���������w�r�s��n���Lx��B��x�J �����A
m�
�������imJ�+�jgR�#���}��G/����C��1��[V�
.������r�M�s`FV2w�VK������O���|r�mc��QU�N.�:�V^��5��P�
n�K����8�LG&#��r#�����G!�b95E����I$�M7AOkA��w��s������5dG��Q>K�<K��Qkr��I���}T�x�y����+�|�n8e���]Ve�fY������[1�V���7�G��w��������CI����_no?�x�k���=[=�����>s��]����zr`�$�����$�)i�cB�f���Zl,�DH�����z�}B�Z0I�C�N���� �s7�������a	����a�oF�I�KN�>�6B�Wj���I��UUSoGaw��$�NA7����&T!6���gc��Wh�����K��}j���$�~+#��s2���d��@\�k�n�k�\�A���zA �}f�������$�����a�:����h���$�Gt���>M����usS;����~�Eo������U���e�5���$�e��W�M�������]��EQ��rfJ�9�V�M���4�"kS$ne������YIa�]��n������A�O~�u�?������O�����0`[�
��q�>���
A���$=+:�!�d�X��5����a�:6�v��&	�}�#���uru��  GdD��@������������<��Js��X����9��.��_>e{�u>}������{�|�|}���]y^	������O�������>;����[���T�����J��t��tO{���c��=����0`�H$��.�fd������K�����&&��������X��4`�,��p�z�kk+E��m�����j�{�j���.��2���<p�o�3�����a^VunN�����G��S1~;����+��.������2�,���t���N�6�������
�_q���V�{*0m/����9 T�ii��(`�A��N&k�����'���	�����g��p��
U�;�<��L�w��� ���JV��hk��7@���:��}u3&W(�����#y&|Fv2���M�Sq�� �!dH'��z��p�����WT5F�e����:H������`t���~�g_��D
�;����{�roR�~�+?:��>~�$&zm�_t���n�m������[����=��g����� L������Q�g�>��:���������IF�N��,�|��%���������tXc�� B	�-Ua�*��*A=wA
���o}�������>����_I&��CvO^�Z���
8>6���Tu0�(��	"9n~�,�	�_%�f�e@q�#ep�.<
������M�������xM�
��U���Y����3$���Yf1�����
$�P&���Ex�q�&J�W��^�(%"�~{�?E���������}0`�G1�0De� U>HDi�T�0W���(Q�3����>�c7S������Z�tt�K0J��A	�����IK� �B�>.���jYb TG8&cdN�������%4<Y5a�I�e��M�{I�Pi����_6N�[9f�����1��_U��4���Ds��H��������H���r�����t�m6�,-�cV��j�;i�������E�����)��7u�
��#������Y�+�.��en��n��6�I+M�S��l9�Q��LG.����r/*:-�Z��F��=���e�n��6-��
�����s���)��.E��j����HC��;����k[�������������,������-�8rA�9��������"mLvgv���^)���jC������D�{��|���y��M=��OK-!}Z
"5�����c:sW�#�"�O���I�������Mw��w�M�]r+�?����r��d��S�$�S��JH�2t��_;����#��f�����5Vn�y����0|H���F�%�$�(�Z�������[���[���,h��5���v���D�{���sz��:��E5�Jr���9���l@��N�|�0 ��a�
L�D����8GF=��!~�eP��)q�cF�O4������U�f��Dq2I*��3D�?�T
n��X�l=��UDa�������:y�j���;�2��?JYc���&���R+~�����S�b��E(bR0Jl�n{�aA���5��~���6(�v��	�� �RZ����9��v9��`������'�p�!��hWK%3�b�bJ��$YP�]��$��s�$K�s�(�2w���	�f
��%p����6X[�|�u!��������z��j�:����'���k�9�9o8�.��5-��L�����n��X�o����������o@���ka���&\��q��E���0����n�1G�3c��1]��vfmM/_��p��jl�')*����n��V�$U��0����jE�b%g�L���pb��������/�FDW���c��8�p=!�����"~s_|��
��3��C���u~k��$������o��$���<��|B<1��F�4`s�q
�������2�Q���?��*��s��sgo3���O�f]���[m��l%��^��1332H��DH�^y����;��Uy������;��a�T`gT���f���i�am;���ZT,�n�s���7S���x�����0�XU�Wt7@������.���<N2|�z��������8�;��|Z2'*fd��.�:m��m��k������YsY�e��rA�9 �n��������n�t*�|\}�+�/�����������r�33�pF�!��� \����Z!����B��YxH��u�h�@��H1�U��j�����#�b��}uO"y?1��6b�x�	���@U
��)�!�@�f"L!F��."���q�Z�`��~&^o���j�M/�7����|a��BZ
��4���J{26h��Z�]��q�A��w�"�N���b bo	�[}�OQ�`��o����}�#^9���Rl�NP#����
Mh�Mj���L�?T�	�/�����w������$G��#�}e����^b�d���j��&�<��}�l�u��`x���mZE�'T�1=�������a�E����������E��B1G�Dg{��?	"��%����U������S?�6����B�L�Y5V���@jEnh�!-�G���lS��l�xS�����*��@�0}K2�=�3�:0(����S��$N9�����W
�n�_\��r����C#�S�H��:��(��������L��s�$m5�j�I����ZD����B.���G9����4I�G]
��+%��`�$��&"$�#��WpFXd�;����4���On�N&dI�T~b���(�`b�G�9=�xN\@������4[�����g�.��vus=I����m�����G���F0/���aYG�Y��e��'g�,��T�L"��@Y�&D�M6�������7���#@AK3&�R�Y���)�����.�K��������\�<�^�O�O��I�U=����R.jE+�������m������T�������LG;�"""*2���D+c#su+�o@6���m�` �@�R�Y����5���\vZ9XY�W�����kw��
�j��8��a��PO!S���#��9qs�y��|�� ���p�������UF-��DDDLLC�V��||�sO����'���C�w�����������k���
��u��	�;f5�����/Oz����s32�2~�����N2�X`� �b�&�40���R t����$@h<NXc{o��.>�,H�L��zI����=a���0�>6�8$��[�������'D���Z+��!�j���ol�&��s2u,{n��
�cy�0I�y2�����$�����
��UQ9(T���%�9��X$2�#��F;����m�:��K��O�����=j#1Gn!2�%����"�B�/������#{��&����������-��v��	~%�m���]zN>:��j��%��@o���F�����MP�n�V��4���)����O%������|���a��ZKd��kY�
DA1���E����2o��2�,-�3���@�&�X��I�5`��m$�H�R����t���o�����fef)�Y���L��T\�=�e�{����$��k'�=R����C6������T�i	�g���n�l��P�G��=���#������|PlI����Lo�����F,�>S�:�����1�������L�c�"�4Q�v�.02����>�q>�{���F��9��v��IY3��X�����"d�n�A6%���(7�\&���O=|4kj���1D`I���6A������������w�y�������],����gq���N�;q�}a��5���(������9�]��V��W��@8�lGRE�	�[��6��TVN_|������b�,b���������#�W�eg��_�z{�(��|3����&�5T��r��-N�N��~v��<���Kqf�(m�aV��Z
��#���������iw���v���g���~�o��������[�kV�j���<�f{�};�g�[���t��z������~h�u��}-��YXb��7�|w�Z�=|���g����������y��1��je{����=��������>�s��������>�������>�;���\�z���u���%t1$���m'�6���{����*yn��j�
-�/�V����/�fyjg�����sO?De�9��2���,�,�,����N�:t�����������?y���<D���&I0 �����,������)������1)6u�#�0o{��[�|��DG6U����7�y�I?�����	#$�hU^��,C#7�M��4Ou�g;�y�0��Iu�a�T�d0����gfb�]�GS����,Z����A�1h��8h��W8o��MvG����M�i�|���Am�N{��o�y��
SvF�dh��VOr�$�
X�9�s�I���������<�T@�^�S'D�16�IC���~~u�I:���q	';��-��:��������!���{�b\E�RQE��e:L��(�j0��$O;���~��;����$��#�$�q����ah�d�z������7{�a����B�s��������\.���TJYF�vT����)@P	'���xc�����W��F�{�T�=<Qih�����|�����������@�"<��o���~f���5��|�0Iy��	��B��dlym��{a���:��!�fd��$�2��!��Z�y���4�C���-�y������a����c����g�<���O|�������0�i�,2��&���,��w���"f��Lg�9�R�}z��o��������s��`�������Q�W���l��x�&S-a�3-��D�k���6�$��3do}�E���������V~�]�������U=?"�_Q�_3.�;��f��]?�V�y�Rx�i;lFQ
*'uF�9�C�
��������w�	y�>�!,�#'���v��X0Dqs�;��{������m��`o#���n���`
�����`�m�m��30���u�1���z�M��Sm���I�J���+W]�b�niC�����~^����'��3T���]yv���x��m�������\�U��x�U�w,���A�������d�v�7@7��{�*M��0r�>jIj�I?}�A&e��l=�������U����x��8@f!<���u�/I�Y>S�w�7�7;����s��L���|�rD�W���=
��<�LD��qC�P2�X@�����s��7[�d�n@<���k��'�l<*�S|CEt`���=S{a�B��Kz�0!��Z�-:�i�#�����NC�-0�y��,�_����/�G��rL��G�_p	�uqb,���f(��N!d(	t��6��4Owm������-��J�C���tBp�=2`�$��Q�1=LD��b9���As���&I'�&d���H���Jn������0s��{�3 w
�Wb ^I�'��$�'���B����r��#~����Pl���lD��A���s�QC�5��t}��?1]�|�K���FI�����A ��'w��V<]�4�&xMkK�e%*��!���Pboy0+��tz�����~�y{��@PH':�$��3%��L�NY-����w� �
0��P�p���4���aK&��Uf*[iZ��0W>[��=�� E��y��E�:���@m^��w2�0`OB>D2%�����!�J��|��F�E�����Q��Aq���������,�;��c���7�����~o33�i��:T`$���C&��mF�����D��{yny��u�;j6`� �[���	ni�������&��~��k]��^f�~�6��z=��x���gS�������22�L�L�6�&C!���2�C!���!�!��d2d2dV22d�C!�!�&M`�� d222��2d2��C!���d�2L��&C&+L��&@���d�d��22L�������c!���2d2����!��d�d�d�`��2d22L�C!����d�d�d���d6�d�d�d2d��6��C!�!�!��d�b��Sj��Sm[o�j�\IYk$��UM�l	@!@
i�$�$��[@$�
��$$ �����VU��BaJ��s��6�9�
U���T�����# V@V���Z5���.W9��s�u����wMs���M�.��"�x�iE9QM**�)���"�����^���J/O�R��J/� =�I�E:�����rN�a�TU�H��O��E=�S��S�����>�)��O�E>�)��wR����J/�����)��w**��S�H4E<�N���?�J.TS�E:�Oy�L"�r�����)�iE<(�v�U{�T�I���SdS��nL��E1�����"�'��E<�)�S�������(�QJb���eH>*@���l��j������/
���r��������������;]�*���:���3UJ���y/v?�������|�~�'��w��$������1���T�D��\�_��"���"@��y����p����$S�"��}H���e���t�`�E2(��	���E.EQC�(�J/�(y�.���({�M�N�z����-��o���;G-�]�������}=��E����.rD�����3�IN�=X���c���R����wvD�9�yc�;U�o@�?��������_������Oc|z������n�n��QIf�� ~Q$�h}��<>��;��o���e�Z�<����[����:;��_�_>]���������!+�=/]���sA@�����=�f�X���v����������}�����i@D��?�_%b#�}��>��+J�,	��Jd���B�'U-�0�JU��hX	�LV3k� Z��uHJnf�
��sD��B�E+t����H�u'ed�P,2�+#+0:�Je���X��:�l�9f�,!��JNb���g��d�����-�&��hH.��_A�������������^�}	�(��P�B�(�w$R�(�UH�:��H��E6��(��s�N����Gw���^��.>]3�z��n|81?�����<��N[?���x.������o����?�vv/y?����b�/�Vgs'�����d<����~��F���>W^w���D�
vo��l�����)�E\�P��.(��(�����/�(�J.B���(�
r?O���<�'�������������V��]����~�y�=��~
�}k�f���C.����~u��9�����z�L��������-I;�; ����W���8h�Q��K�nN���7��o�����
^2��Qy���J-y<�����f�p�������/2����� �����~A��E=��h����NjA�)�E\h����(��)���E�H9R)�W��{h�������.������?������uk�z��:���E�vt���w]MM�e�f� v_7������UL���?�DG�j��t�S���7�[��������5�������U�vvs���F����_<J*����QV�S�����P�b�d���))S�Re�UIM�N��S�QWr)��"�")1H4����h~lT&��lRCdS�(��%)��I�
A�TU�D��qE;��E�(���
�J�K��TS�
����Q9"��OmUG���P"��Q7E0�j������)��%U>(�
�(��/�<�CD��(�� J��R{�+�����(x�P��B�
�r��b)9"��
������������rE<�A����)�.�����E����V����`�r�(�J'�UG�E5��dS���)�E<�D��"�T��WYE�p�^�(�E6E?���Mh��TU��STS��
�M��r�^:EwJ/!E�PN**����(8�MU�����^�
������
p��F��j)�r��tS$S��[�O2)���)�"�B��Q���sR� ��t��UU"�%
��S��n�r�N��QMU�E7%4�Meh�����C���
-�����	D��i���S��{���l�a��S�)�E]TU��������T�RJ'��S���W���X���|�4�aaaaaaaaeJ���eeeeeeeeaaae2�YLb)����iUQ��Q���TS{V����1���c�1�c5��F1�F"���E�#
�DQ�F#��Dc�cY�EDQ���QDc�Vb1E�Fk#�#�bk1���b15��#�#�f1��b#�f#�E���c��1,a)�A�b��h��)!���mUP�(�(��QV�����t���pE:J'S���"���t���S�E9QN�Sx��E:w(�z��9�3�3HHHI!$���HI!!!!$����HHHHHI!$���HI!!$��HHHI!!!!!!!!!!!!$�HHHI!!$���HHHHHHI!$��HHHI!$��������HI!!!$�����HI!$�HI!!!!$���HI!!!$�cOb�����N�pE:E��%����(�0���]
.EE�$��U)�Q>�S��U��P��b�~���G�+�E�(��%��MT���SYU�R%�(�a(����)����E<**��SDS@��*���%Ej��O�rE8P��	�0

#29

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Peter Geoghegan (#28)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, Nov 28, 2023 at 3:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

While I still think that we need heuristics that apply speculative
criteria to decide whether or not going to the next page directly
(when we have a choice at all), that doesn't mean that the v7
heuristics can't be improved on, with a little more thought. It's a
bit tricky, since we're probably also benefiting from the same
heuristics all the time -- probably even for this same test case.

Correction: this particular test case happens to be one where the
optimal strategy is to do *exactly* what the master branch does
currently. The master branch is unbeatable, so the only reasonable
goal for the patch is to not lose (or to lose by no more than a
completely negligible amount).

I'm now prepared to say that this behavior is not okay -- I definitely
need to fix this. It's a bug.

Because each distinct value never fits on one leaf page (it's more
like 1.5 - 2 pages, even though we're deduplicating heavily), and
because Postgres 12 optimizations are so effective with low
cardinality/posting-list-heavy indexes such as this, we're bound to
lose quite often. The only reason it doesn't happen _every single
time_ we descend the index is because the test script uses CREATE
INDEX, rather than using retail inserts (I tend to prefer the latter
for this sort of analysis). Since nbtsort.c isn't as clever/aggressive
about suffix truncation as the nbtsplitloc.c split strategies would
have been, had we used them (had there been retail inserts), many
individual leaf pages are left with high keys that aren't particularly
good targets for the high key precheck optimization (see Postgres 12
commit 29b64d1d).

If I wanted to produce a truly adversarial case for this issue (which
this is already close to), I'd go with the following:

1. Retail inserts that leave each leaf page full of one single value,
which will allow each high key to still make a "clean break" from the
right sibling page -- it'll have the right sibling's value. Maybe
insert 1200 - 1300 tuples per distinct index value for this.

In other words, bulk loading that results in an index that never has
to append a heap TID tiebreaker during suffix truncation, but comes
very close to needing to. Bulk loading where nbtsplitloc.c needs to
use SPLIT_MANY_DUPLICATES all the time, but never quite gets to the
point of needing a SPLIT_SINGLE_VALUE split.

2. A SAOP query with an array with every second value in the index as
an element. Something like "WHERE arr in (2, 4, 6, 8, ...)".

The patch will read every single leaf page, whereas master will
*reliably* only read every second leaf page. I didn't need to "trick"
the patch in a contrived sort of way to get this bad outcome -- this
scenario is fairly realistic. So this behavior is definitely not
something that I'm prepared to defend. As I said, it's a bug.

It'll be fixed in the next revision.

--
Peter Geoghegan

#30

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Heikki Linnakangas (#25)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Nov 27, 2023 at 5:39 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

- +1 on the general idea. Hard to see any downsides if implemented right.

Glad you think so. The "no possible downside" perspective is one that
the planner sort of relies on, so this isn't just a nice-to-have -- it
might actually be a condition of committing the patch. It's important
that the planner can be very aggressive about using SAOP index quals,
without us suffering any real downside at execution time.

- This changes the meaning of amsearcharray==true to mean that the
ordering is preserved with ScalarArrayOps, right? You change B-tree to
make that true, but what about any out-of-tree index AM extensions? I
don't know if any such extensions exist, and I don't think we should
jump through any hoops to preserve backwards compatibility here, but
probably deserves a notice in the release notes if nothing else.

My interpretation is that the planner changes affect amcanorder +
amsearcharray index AMs, but have no impact on mere amsearcharray
index AMs. If anything this is a step *away* from knowing about nbtree
implementation details in the planner (though the planner's definition
of amcanorder is very close to the behavior from nbtree, down to
things like knowing all about nbtree strategy numbers). The planner
changes from the patch are all subtractive -- I'm removing kludges
that were added by bug fix commits. Things that weren't in the
original feature commit at all.

I used the term "my interpretation" here because it seems hard to
think of this in abstract terms, and to write a compatibility note for
this imaginary audience. I'm happy to go along with whatever you want,
though. Perhaps you can suggest a wording for this?

- You use the term "primitive index scan" a lot, but it's not clear to
me what it means. Does one ScalarArrayOps turn into one "primitive index
scan"? Or each element in the array turns into a separate primitive
index scan? Or something in between? Maybe add a new section to the
README explain how that works.

The term primitive index scan refers to the thing that happens each
time _bt_first() is called -- with and without the patch. In other
words, it's what happens when pg_stat_all_indexes.idx_scan is
incremented.

You could argue that that's not quite the right thing to be focussing
on, with this new design. But it has precedent going for it. As I
said, it's the thing that pg_stat_all_indexes.idx_scan counts, which
is a pretty exact and tangible thing. So it's consistent with
historical practice, but also with what other index AMs do when
executing ScalarArrayOps non-natively.

- _bt_preprocess_array_keys() is called for each btrescan(). It performs
a lot of work like cmp function lookups and desconstructing and merging
the arrays, even if none of the SAOP keys change in the rescan. That
could make queries with nested loop joins like this slower than before:
"select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1
and tenk1.two IN (1,2);".

But that's nothing new. _bt_preprocess_array_keys() isn't a new
function, and the way that it's called isn't new in any way.

That said, I certainly agree that we should be worried about any added
overhead in _bt_first for nested loop joins with an inner index scan.
In my experience that can be an important issue. I actually have a
TODO item about this already. It needs to be included in my work on
performance validation, on general principle.

- nbtutils.c is pretty large now. Perhaps create a new file
nbtpreprocesskeys.c or something?

Let me get back to you on this.

- You and Matthias talked about an implicit state machine. I wonder if
this could be refactored to have a more explicit state machine. The
state transitions and interactions between _bt_checkkeys(),
_bt_advance_array_keys() and friends feel complicated.

I agree that it's complicated. That's the main problem that the patch
has, by far. It used to be even more complicated, but it's hard to see
a way to make it a lot simpler at this point. If you can think of a
way to simplify it then I'll definitely give it a go.

Can you elaborate on "more explicit state machine"? That seems like it
could have value by adding more invariants, and making things a bit
more explicit in one or two areas. It could also help us to verify
that they hold from assertions. But that isn't really the same thing
as simplification. I wouldn't use that word, at least.

+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Only queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values are affected.  See
+    <xref linkend="functions-comparisons"/> for details.
+   </para>
+  </note>
+

Is this true even without this patch? Maybe commit this separately.

Yes, it is. The patch doesn't actually change anything in this area.
However, something in this area is new: it's a bit weird (but still
perfectly consistent and logical) that the count shown by
pg_stat_all_indexes.idx_scan will now show a value that is often
influenced by low-level implementation details now. Things that are
fairly far removed from the SQL query now affect that count -- that
part is new. That's what I had in mind when I wrote this, FWIW.

The "Only queries ..." sentence feels difficult. Maybe something like
"For example, queries using IN (...) or = ANY(...) constructs.".

I'll get back to you on this.

* _bt_preprocess_keys treats each primitive scan as an independent piece of
* work. That structure pushes the responsibility for preprocessing that must
* work "across array keys" onto us. This division of labor makes sense once
* you consider that we're typically called no more than once per btrescan,
* whereas _bt_preprocess_keys is always called once per primitive index scan.

"That structure ..." is a garden-path sentence. I kept parsing "that
must work" as one unit, the same way as "that structure", and it didn't
make sense. Took me many re-reads to parse it correctly. Now that I get
it, it doesn't bother me anymore, but maybe it could be rephrased.

I'll do that ahead of the next revision.

Is there _any_ situation where _bt_preprocess_array_keys() is called
more than once per btrescan?

No. Note that we don't know the scan direction within
_bt_preprocess_array_keys(). We need a separate function to set up the
array keys to their initial positions (this is nothing new).

/*
* Look up the appropriate comparison operator in the opfamily.
*
* Note: it's possible that this would fail, if the opfamily is
* incomplete, but it seems quite unlikely that an opfamily would omit
* non-cross-type comparison operators for any datatype that it supports
* at all. ...
*/

I agree that's unlikely. I cannot come up with an example where you
would have cross-type operators between A and B, but no same-type
operators between B and B. For any real-world opfamily, that would be an
omission you'd probably want to fix.

Still I wonder if we could easily fall back if it doesn't exist? And
maybe add a check in the 'opr_sanity' test for that.

I'll see about an opr_sanity test.

In _bt_readpage():

if (!so->firstPage && !numArrayKeys && minoff < maxoff)
{

It's sad to disable this optimization completely for array keys. It's
actually a regression from current master, isn't it? There's no
fundamental reason we couldn't do it for array keys so I think we should
do it.

I'd say whether or not there's any kind of regression in this area is
quite ambiguous, though in a way that isn't really worth discussing
now. If it makes sense to extend something like this optimization to
array keys (or to add a roughly equivalent optimization), then we
should do it. Otherwise we shouldn't.

Note that the patch actually disables two distinct and independent
optimizations when the scan has array keys. Both of these were added
by recent commit e0b1ee17, but they are still independent things. They
are:

1. This skipping thing inside _bt_readpage, which you've highlighted.

This is only applied on the second or subsequent leaf page read by the
scan. Right now, in the case of a scan with array keys, that means the
second or subsequent page from the current primitive index scan --
which doesn't seem particularly principled to me.

I'd need to invent a heuristic that works with my design to adapt the
optimization. Plus I'd need to be able to invalidate the precheck
whenever the array keys advanced. And I'd probably need a way of
guessing whether or not it's likely that the arrays will advance,
ahead of time, so that the precheck doesn't almost always go to waste,
in a way that just doesn't make sense.

Note that all required scan keys are relevant here. I like to think of
plain required equality strategy scan keys without any array as
"degenerate single value arrays". Something similar can be said of
inequality strategy required scan keys (those required in the
*current* scan direction), too. So it's not as if I can "just do the
precheck stuff for the non-array scan keys". All required scan keys
are virtually the same thing as required array-type scan keys -- they
can trigger "roll over", affecting array key advancement for the scan
keys that are associated with arrays.

2. The optimization that has _bt_checkkeys skip non-required scan keys
that are *only* required in the direction *opposite* the current scan
direction -- this can work even without any precheck from
_bt_readpage.

Note that this second optimization relies on various behaviors in
_bt_first() that make it impossible for _bt_checkkeys() to ever see a
tuple that could fail to satisfy such a scan key -- we must always
have passed over non-matching tuples, thanks to _bt_first(). That
prevents my patch with a problem: the logic for determining whether or
not we need a new primitive index scan only promises to never require
the scan to grovel through many leaf pages that _bt_first() could and
should just skip over instead. This new logic makes no promises about
skipping over small numbers of tuples. So it's possible that
_bt_checkkeys() will see a handful of tuples "after the end of the
_bt_first-wise primitive index scan", but "before the _bt_first-wise
start of the next would-be primitive index scan".

Note that this stuff matters even without bringing optimization 2 into
it. There are similar considerations for required equality strategy
scan keys, which (by definition) must be required in both scan
directions. The new mechanism must never act as if it's past the end
of matches in the current scan direction, when in fact it's really
before the beginning of matches (that would lead to totally ignoring a
group of equal matches). The existing _bt_checkkeys() logic can't
really tell the difference on its own, since it only has an = operator
to work with (well, I guess it knows about this context, since there
is a comment about the dependency on _bt_first behaviors in
_bt_checkkeys on HEAD -- very old comments).

_bt_checkkeys() is called in an assertion in _bt_readpage, but it has
the side-effect of advancing the array keys. Side-effects from an
assertion seems problematic.

I agree that that's a concern, but just to be clear: there are no
side-effects presently. You can't mix the array stuff with the
optimization stuff. We won't actually call _bt_checkkeys() in an
assertion when it can cause side-effects.

Assuming that we ultimately conclude that the optimizations *aren't*
worth preserving in any form, it might still be worth making it
obvious that the assertions have no side-effects. But that question is
unsettled right now.

Thanks for the review!

I'll try to get the next revision out soon. It'll also have bug fixes
for mark + restore and for a similar issue seen when the scan changes
direction in just the wrong way. (In short, the array key state
machine can be confused about scan direction in certain corner cases.)

--
Peter Geoghegan

#31

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Peter Geoghegan (#30)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Dec 4, 2023 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

2. The optimization that has _bt_checkkeys skip non-required scan keys
that are *only* required in the direction *opposite* the current scan
direction -- this can work even without any precheck from
_bt_readpage.

Note that this second optimization relies on various behaviors in
_bt_first() that make it impossible for _bt_checkkeys() to ever see a
tuple that could fail to satisfy such a scan key -- we must always
have passed over non-matching tuples, thanks to _bt_first(). That
prevents my patch with a problem: the logic for determining whether or
not we need a new primitive index scan only promises to never require
the scan to grovel through many leaf pages that _bt_first() could and
should just skip over instead. This new logic makes no promises about
skipping over small numbers of tuples. So it's possible that
_bt_checkkeys() will see a handful of tuples "after the end of the
_bt_first-wise primitive index scan", but "before the _bt_first-wise
start of the next would-be primitive index scan".

BTW, I have my doubts about this actually being correct without the
patch. The following comment block appears above _bt_preprocess_keys:

* Note that one reason we need direction-sensitive required-key flags is
* precisely that we may not be able to eliminate redundant keys. Suppose
* we have "x > 4::int AND x > 10::bigint", and we are unable to determine
* which key is more restrictive for lack of a suitable cross-type operator.
* _bt_first will arbitrarily pick one of the keys to do the initial
* positioning with. If it picks x > 4, then the x > 10 condition will fail
* until we reach index entries > 10; but we can't stop the scan just because
* x > 10 is failing. On the other hand, if we are scanning backwards, then
* failure of either key is indeed enough to stop the scan. (In general, when
* inequality keys are present, the initial-positioning code only promises to
* position before the first possible match, not exactly at the first match,
* for a forward scan; or after the last match for a backward scan.)

As I understand it, this might still be okay, because the optimization
in question from Alexander's commit e0b1ee17 (what I've called
optimization 2) is careful about NULLs, which were the one case that
definitely had problems. Note that IS NOT NULL works kind of like
WHERE foo < NULL here (see old bug fix commit 882368e8, "Fix btree
stop-at-nulls logic properly", for more context on this NULLs
behavior).

In any case, my patch isn't compatible with "optimization 2" (as in my
tests break in a rather obvious way) due to a behavior that these old
comments claim is normal within any scan (or perhaps normal in any
scan with scan keys that couldn't be deemed redundant due to a lack of
cross-type support in the opfamily).

Something has to be wrong here -- could just be the comment, I
suppose. But I find it easy to believe that Alexander's commit
e0b1ee17 might not have been properly tested with opfamilies that lack
a suitable cross-type operator. That's a pretty niche thing. (My patch
doesn't need that niche thing to be present to easily break when
combined with "optimization 2", which could hint at an existing and
far more subtle problem.)

--
Peter Geoghegan

#32

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Peter Geoghegan (#24)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Nov 20, 2023 at 6:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

It should be noted that the patch isn't strictly guaranteed to always
read fewer index pages than master, for a given query plan and index.
This is by design. Though the patch comes close, it's not quite a
certainty. There are known cases where the patch reads the occasional
extra page (relative to what master would have done under the same
circumstances). These are cases where the implementation just cannot
know for sure whether the next/sibling leaf page has key space covered
by any of the scan's array keys (at least not in a way that seems
practical). The implementation has simple heuristics that infer (a
polite word for "make an educated guess") about what will be found on
the next page. Theoretically we could be more conservative in how we
go about this, but that seems like a bad idea to me. It's really easy
to find cases where the maximally conservative approach loses by a
lot, and really hard to show cases where it wins at all.

Attached is v8, which pretty much rips all of this stuff out.

I definitely had a point when I said that it made sense to be
optimistic about finding matches on the next page in respect of any
truncated -inf attributes in high keys, though. And so we still do
that much in v8. But, there is no reason why we need to go any further
than that -- there is no reason why we should *also* be optimistic
about *untruncated* high key/finaltup attributes that *aren't* exact
matches for any of the scan's required array keys finding exact
matches once we move onto the next sibling page.

I reached this conclusion when working on a fix for the low
cardinality index regression that Tomas' tests brought to my attention
[1]: /messages/by-id/CAH2-WzmtV7XEWxf_rP1pw=vyDjGLi__zGOy6Me5MovR3e1kfdg@mail.gmail.com
a very targeted way, but quickly realized that it made almost no sense
to just limit myself to the low cardinality cases. Even with Tomas'
problematic low cardinality test cases, I saw untruncated high key
attributes that were "close by to matching tuples" -- they just
weren't close enough (i.e. exactly on the next leaf page) for us to
actually win (so v7 lost). Being almost correct again and again, but
still losing again and again is a good sign that certain basic
assumptions were faulty (at least if it's realistic, which it was in
this instance).

To be clear, and to repeat, even in v8 we'll still "make guesses"
about -inf truncated attributes. But it's a much more limited form of
guessing. If we didn't at least do the -inf thing, then backwards
scans would weirdly work better than forward scans in some cases -- a
particular concern with queries that have index quals for each of
multiple columns. I don't think that this remaining "speculative"
behavior needs to be discussed at very much length in code comments,
though. That's why v8 is a great deal simpler than v7 was here. No
more huge comment block at the end of the new _bt_checkkeys.

Notable stuff that *hasn't* changed from v7:

I'm posting this v8 having not yet worked through all of Heikki's
feedback. In particular, v8 doesn't deal with the relatively hard
question of what to do about the optimizations added by Alexander
Korotkov's commit e0b1ee17 (should I keep them disabled, selectively
re-enable one or both optimizations, or something else?). This is
partly due to at least one of the optimizations having problems of
their own that are still outstanding [2]/messages/by-id/CAH2-Wzn0LeLcb1PdBnK0xisz8NpHkxRrMr3NWJ+KOK-WZ+QtTQ@mail.gmail.com -- Peter Geoghegan. I also wanted to get a
revision out before travelling to Prague for PGConf.EU, which will be
followed by other Christmas travel. That's likely to keep me away from
the patch for weeks (that, plus I'm moving to NYC in early January).
So I just ran out of time to do absolutely everything.

Other notable changes:

* Bug fixes for cases where the array state machine gets confused by a
change in the scan direction, plus similar cases involving mark and
restore processing.

I'm not entirely happy with my approach here (mostly referring to the
new code in _bt_steppage here). Feels like it needs to be a little
better integrated with mark/restore processing.

No doubt Heikki will have his own doubts about this. I've included my
test cases for the issues in this area. The problems are really hard
to reproduce, and writing these tests took a surprisingly large amount
of effort. The tests might not be suitable for commit, but you really
need to see the test cases to be able to review the code efficiently.
It's just fiddly.

* I've managed to make the array state machine just a little more
streamlined compared to v7.

Minor code polishing, not really worth describing in detail.

* Addressed concerns about incomplete opfamilies not working with the
patch by updating the error message within
_bt_preprocess_array_keys/_bt_sort_array_cmp_setup. It now exactly
matches the similar one in _bt_first.

I don't think that we need any new opr_sanity tests, since we already
have one for this. Making the error message match the one in _bt_first
ensures that anybody that runs into a problem here will see the same
error message that they'd have seen on earlier versions, anyway.

It's a more useful error compared to the one from v7 (in that it names
the index and its attribute directly). Plus it's good to be
consistent.

I don't see any potential for the underlying _bt_sort_array_cmp_setup
behavior to be seen as a regression, in terms of our ability to cope
with incomplete opfamilies (compared to earlier Postgres versions).
Opfamilies that lack a cross-type ORDER proc mixed with queries that
use the corresponding cross-type = operator were always very dicey.
That situation isn't meaningfully different with the patch.

(Actually, this isn't 100% true in the case of queries + indexes with
non-required arrays specifically -- they'll need a 3-way comparator in
_bt_preprocess_array_keys/_bt_sort_array_cmp_setup, and yet *won't*
need one moments later, in _bt_first. This is because non-required
scan keys don't end up in _bt_first's insertion scan key at all, in
general. This distinction just seems pedantic, though. We're talking
about a case where things accidentally failed to fail in previous
versions, for some queries but not most queries. Now it'll fail in
exactly the same way in slightly more cases. In reality, affected
opfamilies are practically non-existent, so this is a hypothetical
upon a hypothetical.)

[1]: /messages/by-id/CAH2-WzmtV7XEWxf_rP1pw=vyDjGLi__zGOy6Me5MovR3e1kfdg@mail.gmail.com
[2]: /messages/by-id/CAH2-Wzn0LeLcb1PdBnK0xisz8NpHkxRrMr3NWJ+KOK-WZ+QtTQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v8-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v8-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 947f6c9272ab36ff858ee0568c1f51d36b1c2f19 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v8] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The array state machine now advances using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   47 +-
 src/backend/access/nbtree/nbtree.c         |   80 +-
 src/backend/access/nbtree/nbtsearch.c      |  119 +-
 src/backend/access/nbtree/nbtutils.c       | 1814 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   86 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   15 +
 src/test/regress/expected/btree_index.out  |  479 ++++++
 src/test/regress/expected/create_index.out |   31 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/btree_index.sql       |  147 ++
 src/test/regress/sql/create_index.sql      |   10 +-
 12 files changed, 2599 insertions(+), 356 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 5e083591a..ee15b0f93 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,7 +960,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1024,7 +1024,6 @@ typedef struct BTArrayKeyInfo
 {
 	int			scan_key;		/* index of associated key in arrayKeyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1038,13 +1037,14 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	ScanDirection advanceDir;	/* Scan direction when arrays last advanced */
+	bool		needPrimScan;	/* Need primscan to continue in advanceDir? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1078,6 +1078,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the final tuple from the page.  This must happen
+ * before the first call to _bt_checkkeys.  _bt_checkkeys uses the final tuple
+ * to manage advancement of the scan's array keys more efficiently.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* final tuple (high key for forward scans) */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		finaltupchecked;	/* final tuple checked against current
+									 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1085,6 +1108,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_RDDNARRAY	0x00040000	/* redundant in array preprocessing */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1155,7 +1179,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1248,12 +1272,11 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_rewind_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup, int tupnatts,
 						  bool requiredMatchedByPrecheck);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6c8cd93fa..917af48e8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -235,7 +235,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		_bt_start_array_keys(scan, dir);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -277,8 +277,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -305,7 +305,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 		_bt_start_array_keys(scan, ForwardScanDirection);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -335,8 +335,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -366,9 +366,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -408,7 +410,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	so->firstPage = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
@@ -508,10 +512,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -522,10 +522,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -564,6 +560,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Rewind the scan's array keys, if any */
+			if (so->numArrayKeys)
+				_bt_rewind_array_keys(scan);
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -590,7 +589,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -616,7 +615,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -627,7 +626,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -658,16 +661,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -699,9 +703,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -735,12 +742,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -754,14 +760,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -770,13 +776,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index be61b3868..04b7e1f15 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,7 +907,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -1527,10 +1527,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
+	BTReadPageState pstate;
+	int			numArrayKeys,
+				itemIndex;
 	int			indnatts;
-	bool		requiredMatchedByPrecheck;
+	bool		requiredMatchedByPrecheck = false;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1550,8 +1551,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.finaltupchecked = false;
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	numArrayKeys = so->numArrayKeys;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1599,9 +1605,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 * the last item on the page would give a more precise answer.
 	 *
 	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * slowdown of point queries.  Never apply the optimization with a scan
+	 * that uses array keys, either, since that breaks certain assumptions.
+	 * (Our search-type scan keys change whenever _bt_checkkeys advances the
+	 * arrays, invalidating any precheck.  Tracking all that would be tricky.)
 	 */
-	if (!so->firstPage && minoff < maxoff)
+	if (!so->firstPage && !numArrayKeys && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,22 +1619,25 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to
-		 * 'requiredMatchedByPrecheck' to 'continuescan' argument.  That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Flag variable is set when all scan keys that are required in the
+		 * current scan direction are satisfied by the last item on the page
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &requiredMatchedByPrecheck, false);
-	}
-	else
-	{
-		so->firstPage = false;
-		requiredMatchedByPrecheck = false;
+		_bt_checkkeys(scan, &pstate, itup, false, indnatts, false);
+		requiredMatchedByPrecheck = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
+	so->firstPage = false;
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1650,8 +1662,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, false, indnatts,
+										 requiredMatchedByPrecheck);
 
 			/*
 			 * If the result of prechecking required keys was true, then in
@@ -1659,8 +1671,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * result is the same.
 			 */
 			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+												 indnatts, false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
@@ -1694,7 +1706,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1711,17 +1723,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
+			_bt_checkkeys(scan, &pstate, itup, true, truncatt, false);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1731,6 +1743,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (numArrayKeys && minoff <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1742,6 +1762,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1752,12 +1773,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1770,8 +1797,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan, requiredMatchedByPrecheck);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+										 indnatts, requiredMatchedByPrecheck);
 
 			/*
 			 * If the result of prechecking required keys was true, then in
@@ -1779,8 +1806,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * result is the same.
 			 */
 			Assert(!requiredMatchedByPrecheck ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false));
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup,
+												 finaltup, indnatts, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1819,7 +1846,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1994,6 +2021,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the right
+		 */
+		if (ScanDirectionIsBackward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreRight = true;
+			so->advanceDir = dir;
+			so->needPrimScan = false;
+		}
+
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 	}
@@ -2002,6 +2043,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreRight = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the left
+		 */
+		if (ScanDirectionIsForward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreLeft = true;
+			so->advanceDir = dir;
+			so->needPrimScan = false;
+		}
+
 		if (scan->parallel_scan != NULL)
 		{
 			/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index f25d62b05..1a04a6a1a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,42 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
+static int	_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple, int sktrig,
+										 bool validtrig);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int sktrig);
+static void _bt_update_keys_with_arraykeys(IndexScanDesc scan);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  int numArrayKeys, bool *continuescan, int *ikey,
+							  bool requiredMatchedByPrecheck);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -190,13 +217,48 @@ _bt_freestack(BTStack stack)
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
  * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * array elements.
+ *
+ * _bt_preprocess_keys treats each primitive scan as an independent piece of
+ * work.  We perform all preprocessing that must work "across array keys".
+ * This division of labor makes sense once you consider that we're called only
+ * once per btrescan, whereas _bt_preprocess_keys is called once per primitive
+ * index scan.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array key as redundant.  Similarly, we are capable of "merging
+ * together" multiple equality array keys from two or more input scan keys
+ * into a single output scan key that contains only the intersecting array
+ * elements.  This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.
+ *
+ * Note: _bt_start_array_keys actually sets up the cur_elem counters later on,
+ * once the scan direction is known.
  *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
+ *
+ * Note: _bt_preprocess_keys is responsible for creating the so->keyData scan
+ * keys used by _bt_checkkeys.  Index scans that don't use equality array keys
+ * will have _bt_preprocess_keys treat scan->keyData as input and so->keyData
+ * as output.  Scans that use equality array keys have _bt_preprocess_keys
+ * treat so->arrayKeyData (which is our output) as their input, while (as per
+ * usual) outputting so->keyData for _bt_checkkeys.  This function adds an
+ * additional layer of indirection that allows _bt_preprocess_keys to more or
+ * less avoid dealing with SK_SEARCHARRAY as a special case.
+ *
+ * Note: _bt_update_keys_with_arraykeys works by updating already-processed
+ * output keys (so->keyData) in-place.  It cannot eliminate redundant or
+ * contradictory scan keys.  This necessitates having _bt_preprocess_keys
+ * understand that it is unsafe to eliminate "redundant" SK_SEARCHARRAY
+ * equality scan keys on the basis of what is actually just the current array
+ * key values -- it must conservatively assume that such a scan key might no
+ * longer be redundant after the next _bt_update_keys_with_arraykeys call.
+ * Ideally we'd be able to deal with that by eliminating a subset of truly
+ * redundant array keys up-front, but it doesn't seem worth the trouble.
  */
 void
 _bt_preprocess_array_keys(IndexScanDesc scan)
@@ -204,7 +266,10 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
+	int			lastEqualityArrayAtt = -1;
+	Oid			lastOrderProc = InvalidOid;
 	ScanKey		cur;
 	int			i;
 	MemoryContext oldContext;
@@ -257,6 +322,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc0(nkeyatts * sizeof(FmgrInfo));
+	so->advanceDir = NoMovementScanDirection;
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -273,6 +340,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.   Set
+		 * that up now.  (Avoids repeating work for the same attribute.)
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			!OidIsValid(so->orderProcs[cur->sk_attno - 1].fn_oid))
+			_bt_sort_array_cmp_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -349,6 +426,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and redundant whole scan keys)
+		 */
+		if (lastEqualityArrayAtt == cur->sk_attno &&
+			lastOrderProc == cur->sk_func.fn_oid)
+		{
+			BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+			Assert(so->arrayKeyData[prev->scan_key].sk_subtype ==
+				   cur->sk_subtype);
+
+			num_elems = _bt_merge_arrays(scan, cur,
+										 (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+										 prev->elem_values, prev->num_elems,
+										 elem_values, num_elems);
+
+			pfree(elem_values);
+
+			/*
+			 * If there are no intersecting elements left from merging this
+			 * array into the previous array on the same attribute, the scan
+			 * qual is unsatisfiable
+			 */
+			if (num_elems == 0)
+			{
+				numArrayKeys = -1;
+				break;
+			}
+
+			/*
+			 * Lower the number of elements from the previous array, and mark
+			 * this scan key/array as redundant for every primitive index scan
+			 */
+			prev->num_elems = num_elems;
+			cur->sk_flags |= SK_BT_RDDNARRAY;
+			continue;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
 		 */
@@ -356,6 +473,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		lastEqualityArrayAtt = cur->sk_attno;
+		lastOrderProc = cur->sk_func.fn_oid;
 	}
 
 	so->numArrayKeys = numArrayKeys;
@@ -429,26 +548,28 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 }
 
 /*
- * _bt_sort_array_elements() -- sort and de-dup array elements
+ * _bt_sort_array_cmp_setup() -- Look up array comparison function
  *
- * The array elements are sorted in-place, and the new number of elements
- * after duplicate removal is returned.
- *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * Sets so->orderProcs[] for scan key's attribute.  This is used to sort and
+ * deduplicate the attribute's array (if any).  It's also used during binary
+ * searches of the next array key matching index tuples just beyond the range
+ * of the scan's current set of array keys.
  */
-static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
-						Datum *elems, int nelems)
+static void
+_bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey)
 {
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 	Oid			elemtype;
 	RegProcedure cmp_proc;
-	BTSortArrayContext cxt;
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
 
-	if (nelems <= 1)
-		return nelems;			/* no work to do */
+	/*
+	 * Should do this for all equality strategy scan keys only (including
+	 * those without any array).  See _bt_advance_array_keys for details of
+	 * why we need an ORDER proc for non-array equality strategy scan keys.
+	 */
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
 
 	/*
 	 * Determine the nominal datatype of the array elements.  We have to
@@ -462,22 +583,44 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 	/*
 	 * Look up the appropriate comparison function in the opfamily.
 	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
+	 * Note: it's possible that this would fail, if the opfamily lacks the
+	 * required cross-type ORDER proc.  But this is no different to the case
+	 * where _bt_first fails to find an ORDER proc for its insertion scan key.
 	 */
 	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
+								 rel->rd_opcintype[skey->sk_attno - 1], elemtype,
 								 BTORDER_PROC);
 	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, rel->rd_opcintype[skey->sk_attno - 1], elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Save in orderproc entry for attribute */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
+/*
+ * _bt_sort_array_elements() -- sort and de-dup array elements
+ *
+ * The array elements are sorted in-place, and the new number of elements
+ * after duplicate removal is returned.
+ *
+ * scan and skey identify the index column, whose opfamily determines the
+ * comparison semantics.  If reverse is true, we sort in descending order.
+ */
+static int
+_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
+						bool reverse,
+						Datum *elems, int nelems)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+
+	if (nelems <= 1)
+		return nelems;			/* no work to do */
 
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -488,6 +631,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	int			merged_nelems = 0;
+
+	/*
+	 * Incrementally copy the original array into a temp buffer, skipping over
+	 * any items that are missing from the "next" array
+	 */
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0; i < nelems_orig; i++)
+	{
+		Datum	   *elem = elems_orig + i;
+
+		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+						_bt_compare_array_elements, &cxt))
+			merged[merged_nelems++] = *elem;
+	}
+
+	/*
+	 * Overwrite the original array with temp buffer so that we're only left
+	 * with intersecting array elements
+	 */
+	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	pfree(merged);
+
+	return merged_nelems;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -499,7 +684,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -507,6 +692,159 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  It's safe for searches against required scan key arrays to
+ * reuse earlier search bounds like this because such arrays always advance in
+ * lockstep with the index scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem,
+				mid_elem,
+				high_elem,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	low_elem = 0;
+	mid_elem = -1;
+	high_elem = array->num_elems - 1;
+	if (cur_elem_start)
+	{
+		if (ScanDirectionIsForward(dir))
+			low_elem = array->cur_elem;
+		else
+			high_elem = array->cur_elem;
+	}
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * it's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the array element we'll return.  Set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -532,29 +870,40 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
 
-	so->arraysStarted = true;
+	so->advanceDir = dir;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
+ *
+ * Note: routine only initializes so->arrayKeyData[] scankeys.  Caller must
+ * either call _bt_update_keys_with_arraykeys or call _bt_preprocess_keys to
+ * update the scan's search-type scankeys.
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		found = false;
-	int			i;
+
+	Assert(!so->needPrimScan);
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
@@ -588,85 +937,989 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+	if (found)
+		return true;
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * Don't allow the entire set of array keys to roll over: restore the
+	 * array keys to the state they were in before we were called.
+	 *
+	 * This ensures that the array keys only ratchet forward (or backwards in
+	 * the case of backward scans).  Our "so->arrayKeyData[]" scan keys should
+	 * always match the current "so->keyData[]" search-type scan keys (except
+	 * for a brief moment during array key advancement).
 	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
-}
-
-/*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
- *
- * Save the current state of the array keys as the "mark" position.
- */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
-
-	for (i = 0; i < so->numArrayKeys; i++)
+	for (int i = 0; i < so->numArrayKeys; i++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		BTArrayKeyInfo *rollarray = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[rollarray->scan_key];
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
+		if (ScanDirectionIsBackward(dir))
+			rollarray->cur_elem = 0;
+		else
+			rollarray->cur_elem = rollarray->num_elems - 1;
+		skey->sk_argument = rollarray->elem_values[rollarray->cur_elem];
 	}
+
+	return false;
 }
 
 /*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
+ * _bt_rewind_array_keys() -- Handle array keys during btrestrpos
  *
- * Restore the array keys to where they were when the mark was set.
+ * Restore the array keys to the start of the key space for the current scan
+ * direction.
  */
 void
-_bt_restore_array_keys(IndexScanDesc scan)
+_bt_rewind_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		changed = false;
-	int			i;
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->advanceDir != NoMovementScanDirection);
+
+	/*
+	 * Restore each array key to its initial position for the current scan
+	 * direction as of the last time the arrays advanced
+	 */
+	for (int i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		int			first_elem_dir;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		if (ScanDirectionIsForward(so->advanceDir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = curArrayKey->num_elems - 1;
+
+		if (curArrayKey->cur_elem != first_elem_dir)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
+			curArrayKey->cur_elem = first_elem_dir;
+			skey->sk_argument = curArrayKey->elem_values[first_elem_dir];
 			changed = true;
 		}
 	}
 
+	if (changed)
+		_bt_update_keys_with_arraykeys(scan);
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
+	 * Invert the scan direction as of the last time the array keys advanced.
 	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * This prevents _bt_steppage from fully trusting currPos.moreRight and
+	 * currPos.moreLeft in cases where _bt_readpage/_bt_checkkeys don't get
+	 * the opportunity to consider advancing the array keys as expected.
 	 */
-	if (changed || !so->arraysStarted)
-	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
+	if (ScanDirectionIsForward(so->advanceDir))
+		so->advanceDir = BackwardScanDirection;
+	else
+		so->advanceDir = ForwardScanDirection;
 }
 
+/*
+ * _bt_tuple_before_array_skeys() -- _bt_checkkeys array helper function
+ *
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) must advance the scan's array keys.
+ * Only call here when _bt_check_compare already set continuescan=false.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfying our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it is now time for our
+ * caller to advance the array keys (unless caller broke the rules by not
+ * checking with _bt_check_compare before calling here).
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * _bt_advance_array_keys and its handling of inequalities for details.
+ *
+ * Note: caller passes _bt_check_compare-set sktrig value to indicate which
+ * scan key triggered the call.  If this is for any scan key that isn't a
+ * required equality strategy scan key, calling here is a no-op, meaning that
+ * we'll invariably return false.  We just accept whatever _bt_check_compare
+ * indicated about the scan when it involves a required inequality scan key.
+ * We never care about nonrequired scan keys, including equality strategy
+ * array scan keys (though _bt_check_compare can temporarily end the scan to
+ * advance their arrays in _bt_advance_array_keys, which we'll never prevent).
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple, int sktrig, bool validtrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData + sktrig, ikey = sktrig;
+		 ikey < so->numberOfKeys;
+		 cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
+
+		/*
+		 * Unlike _bt_check_compare and _bt_advance_array_keys, we never deal
+		 * with inequality strategy scan keys (even those marked required). We
+		 * also don't deal with non-required equality keys -- even when they
+		 * happen to have arrays that might need to be advanced.
+		 *
+		 * Note: cannot "break" here due to corner cases involving redundant
+		 * scan keys that weren't eliminated within _bt_preprocess_keys.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			(cur->sk_flags & SK_BT_REQFWD) == 0)
+			continue;
+
+		/* Required equality scan keys always required in both directions */
+		Assert((cur->sk_flags & SK_BT_REQFWD) &&
+			   (cur->sk_flags & SK_BT_REQBKWD));
+
+		if (attnum > ntupatts)
+		{
+			Assert(!validtrig);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys, forcing another _bt_advance_array_keys call.
+			 *
+			 * You might wonder why we don't treat truncated attributes as
+			 * having values < our equality constraints instead; we're not
+			 * treating the truncated attributes as having -inf values here,
+			 * which is how things are done in _bt_compare.
+			 *
+			 * We're often called during finaltup prechecks, where we help our
+			 * caller to decide whether or not it should terminate the current
+			 * primitive index scan.  Our behavior here implements a policy of
+			 * being slightly optimistic about what will be found on the next
+			 * page when the current primitive scan continues onto that page.
+			 * (This is also closest to what _bt_check_compare does.)
+			 */
+			return false;
+		}
+
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?  (This implements the linear search process
+		 * described in _bt_advance_array_keys.)
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?
+		 */
+		if (validtrig || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconcusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or perhaps a call made from an
+		 * assertion.
+		 */
+		Assert(result == 0);
+		Assert(!validtrig);
+	}
+
+	/*
+	 * Default assumption is that caller must now advance the array keys.
+	 *
+	 * Note that we'll always end up here when sktrig corresponds to some
+	 * non-required array type scan key that _bt_check_compare saw wasn't
+	 * satisfied by caller's tuple.
+	 */
+	return false;
+}
+
+/*
+ * _bt_array_keys_remain() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(so->advanceDir == dir);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  This tells us what to do.  We cannot rely on _bt_first
+	 * always reaching _bt_checkkeys, though.  There are various cases where
+	 * that won't happen.  For example, if the index is completely empty, then
+	 * _bt_first won't get as far as calling _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys...
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * ...though there is one exception: _bt_first's _bt_preprocess_keys
+		 * call can determine that the scan's input scan keys can never be
+		 * satisfied.  That might be true for one set of array keys, but not
+		 * the next set.
+		 *
+		 * Handle this by advancing the array keys incrementally ourselves.
+		 * When this succeeds, start another primitive index scan.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		Assert(!so->needPrimScan);
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+
+		/* Array keys are now exhausted */
+	}
+
+	/*
+	 * Has another primitive index scan been scheduled by _bt_checkkeys?
+	 */
+	if (so->needPrimScan)
+	{
+		/* Yes -- tell caller to call _bt_first once again */
+		so->needPrimScan = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Terminate the top-level scan.
+	 */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Like _bt_check_compare, our return value indicates if tuple satisfied the
+ * qual (specifically our new qual).  There must be a new qual whenever we're
+ * called (unless the top-level scan terminates).  After we return, all later
+ * calls to _bt_check_compare will also use the same new qual (a qual with the
+ * newly advanced array key values that were set here by us).
+ *
+ * We'll also set pstate.continuescan for caller.  When this is set to false,
+ * it usually just ends the ongoing primitive index scan (we'll have scheduled
+ * another one in passing).  But when all required array keys were exhausted,
+ * setting pstate.continuescan=false here ends the top-level index scan (since
+ * no new primitive scan will have been scheduled).  Most calls here will have
+ * us set pstate.continuescan=true, which just indicates that the scan should
+ * proceed onto the next tuple (just like when _bt_check_compare does it).
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys.  Calling here before that
+ * point will prematurely advance the array keys, leading to wrong query
+ * results.
+ *
+ * We're responsible for ensuring that caller's tuple is <= current/newly
+ * advanced required array keys once we return.  We try to find an exact
+ * match, but failing that we'll advance the array keys to whatever set of
+ * array elements comes next in the key space for the current scan direction.
+ * Required array keys "ratchet forwards".  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The invariants are the same for backwards scans, except that the operators
+ * are flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Sometimes that'll
+ * be the sole reason for calling here.  These calls are the only exception to
+ * the general rule about always advancing required array keys (since they're
+ * the only case where we simply don't need to touch any required array, which
+ * must already be satisfied by caller's tuple).  Calls triggered by any scan
+ * key that's required in the current scan direction are strictly guaranteed
+ * to advance the required array keys (or end the top-level scan), though.
+ *
+ * Note that we deal with all required equality strategy scan keys here; it's
+ * not limited to array scan keys.  They're equality constraints for our
+ * purposes, and so are handled as degenerate single element arrays here.
+ * Obviously, they can never really advance in the way that real arrays can,
+ * but they must still affect how we advance real array scan keys, just like
+ * any other equality constraint.  We have to keep around a 3-way ORDER proc
+ * for these (just using the "=" operator won't do), since in general whether
+ * the tuple is < or > some non-array equality key might influence advancement
+ * of any of the scan's actual arrays.  The top-level scan can only terminate
+ * after it has processed the key space covered by the product of each and
+ * every equality constraint, including both non-arrays and (required) arrays.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple according to _bt_check_compare.  This is how we
+ * deal with inequalities that are required in the current scan direction.
+ * They can advance the array keys here, even though they don't influence the
+ * initial positioning strategy within _bt_first (only inequalities required
+ * in the _opposite_ direction to the scan influence _bt_first in this way).
+ *
+ * As discussed already, we guarantee that the array keys will either be
+ * advanced such that caller's tuple is <= the new array keys in respect of
+ * required array keys (plus any other required equality strategy scan keys)
+ * when we return (unless the arrays are totally exhausted instead).  The real
+ * guarantee is actually slightly stronger than that, though it only matters
+ * to scans that have required inequality strategy scan keys.  The precise
+ * promise we make is that the array keys will always advance to the maximum
+ * possible extent that we can know to be safe based on caller's tuple alone.
+ * Note that it's just about possible that every required equality strategy
+ * scan key will be satisfied (or could be satisfied by advancing the array
+ * keys), yet we might advance the array keys _beyond_ our exactly-matching
+ * element values due to a still-unsatisfied inequality strategy scan key.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_exhausted,
+				sktrigrequired = false,
+				beyond_end_advance = false,
+				foundRequiredOppositeDirOnly = false,
+				all_arraylike_sk_satisfied = true,
+				all_required_sk_satisfied = true;
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	/*
+	 * Iterate through the scan's search-type scankeys (so->keyData[]), and
+	 * set input scan keys (so->arrayKeyData[]) to new array values
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno;
+		Datum		tupdatum;
+		bool		requiredSameDir = false,
+					requiredOppositeDirOnly = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		/*
+		 * Set up ORDER 3-way comparison function and array state
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		if (cur->sk_flags & SK_SEARCHARRAY &&
+			cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+			Assert(skeyarray->sk_attno == attnum);
+		}
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			requiredSameDir = true;
+		else if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
+				 ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
+			requiredOppositeDirOnly = true;
+
+		/*
+		 * Optimization: Skip over known-satisfied scan keys
+		 */
+		if (ikey < sktrig)
+			continue;
+		if (ikey == sktrig)
+			sktrigrequired = requiredSameDir;
+
+		/*
+		 * When we come across an inequality scan key that's required in the
+		 * opposite direction only, and that might affect where our scan ends,
+		 * remember it.  We'll only need this information when all prior
+		 * equality constraints are satisfied.
+		 */
+		if (requiredOppositeDirOnly && sktrigrequired &&
+			all_arraylike_sk_satisfied)
+		{
+			Assert(cur->sk_strategy != BTEqualStrategyNumber);
+			Assert(all_required_sk_satisfied);
+			Assert(!foundRequiredOppositeDirOnly);
+
+			foundRequiredOppositeDirOnly = true;
+
+			continue;
+		}
+
+		/*
+		 * Other than that, we're not interested in scan keys that aren't
+		 * required in the current scan direction (unless they're non-required
+		 * array equality scan keys, which still need to be advanced by us)
+		 */
+		if (!requiredSameDir && !array)
+			continue;
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now. (We only do comparisons of any required
+		 * non-array equality scan keys that come after the triggering key.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; it knows all the intricacies around
+		 * evaluating inequality strategy scan keys (e.g., row comparisons).
+		 * There is no simple mapping onto the opclass ORDER proc we can use.
+		 * But once we know that we have an unsatisfied inequality, we can
+		 * treat it in the same way as an unsatisfied equality at this point.
+		 * (We don't need to worry about later required inequalities, since
+		 * there can't be any after the first one.  While it's possible that
+		 * _bt_preprocess_keys failed to determine which of several "required"
+		 * scan keys for this same attribute and same scan direction are truly
+		 * required, that changes nothing, really.  Even in this corner case,
+		 * we can safely assume that any other "required" inequality that is
+		 * still satisfied must have been redundant all along.)
+		 *
+		 * The arrays advance correctly in both cases because both involve the
+		 * scan reaching the end of the key space for a higher order array key
+		 * (or some distinct set of higher-order array keys, taken together).
+		 * The only real difference is that in the equality case the end is
+		 * "strictly at the end of an array key", whereas in the inequality
+		 * case it's "within an array key".  Either way we'll increment higher
+		 * order arrays by one increment (the next-highest array might need to
+		 * roll over to the next-next highest array in turn, and so on).
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(requiredSameDir);
+			Assert(!arrays_advanced);
+
+			beyond_end_advance = true;
+			all_arraylike_sk_satisfied = all_required_sk_satisfied = false;
+
+			continue;
+		}
+
+		/*
+		 * Nothing for us to do with a required inequality strategy scan key
+		 * that wasn't the one that _bt_check_compare stopped on
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 *
+		 * NB: We must do this for all arrays -- not just required arrays.
+		 * Otherwise the incremental array advancement step won't "carry".
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				skeyarray->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * required scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_arraylike_sk_satisfied || attnum > ntupatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			/*
+			 * Truncated -inf value will always be assumed to satisfy any
+			 * required equality scan keys according to _bt_check_compare.
+			 * This avoids a later _bt_check_compare recheck.
+			 *
+			 * Deliberately don't unset all_required_sk_satisfied here.  This
+			 * follows _bt_tuple_before_array_skeys's example.  We don't want
+			 * to treat -inf as a non-match when making a final decision on
+			 * whether to move to the next page.  This implements a policy of
+			 * being optimistic about finding real matches for lower-order
+			 * required attributes that are truncated to -inf in finaltup.
+			 */
+			all_arraylike_sk_satisfied = false;
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		ratchets = (requiredSameDir && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+											  tupdatum, tupnull,
+											  array, cur, &result);
+
+			/*
+			 * Required arrays only ever ratchet forwards (backwards).
+			 *
+			 * This condition makes it safe for binary searches to skip over
+			 * array elements that the scan must already be ahead of by now.
+			 * That is strictly an optimization.  Our assertion verifies that
+			 * the condition holds, which doesn't depend on the optimization.
+			 */
+			Assert(!ratchets ||
+				   ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+					(ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(requiredSameDir);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array.
+			 *
+			 * We really do need an ORDER proc for this (we can't just rely on
+			 * the scan key's equality operator).  We need to know whether the
+			 * tuple as a whole is either behind or ahead of (or covered by)
+			 * the key space represented by our required arrays as a group.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling of required equality scan keys
+		 * that lack a real array for us to advance, either.  It also won't
+		 * matter if all of the scan's real arrays are non-required arrays.
+		 */
+		if (requiredSameDir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		/*
+		 * Also track whether all relevant attributes from caller's tuple will
+		 * be equal to the scan's array keys once we're done with it
+		 */
+		if (result != 0)
+		{
+			all_arraylike_sk_satisfied = false;
+			if (requiredSameDir)
+				all_required_sk_satisfied = false;
+		}
+
+		/*
+		 * Optimization: If this call was triggered by a non-required array,
+		 * and we know that tuple won't satisfy the qual, we give up right
+		 * away.  This often avoids advancing the array keys, which will save
+		 * wasted cycles from calling _bt_update_keys_with_arraykeys below.
+		 */
+		if (!all_arraylike_sk_satisfied && !sktrigrequired)
+		{
+			Assert(!requiredSameDir && !foundRequiredOppositeDirOnly);
+			Assert(!beyond_end_advance);
+
+			break;
+		}
+
+		/* Advance array keys, even if set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+		}
+	}
+
+	/*
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is the only way that
+	 * the top-level index scan can be terminated here by us.
+	 */
+	arrays_exhausted = false;
+	if (beyond_end_advance)
+	{
+		/* Non-required scan keys never exhaust arrays/end top-level scan */
+		Assert(sktrigrequired && !all_required_sk_satisfied);
+
+		if (!_bt_advance_array_keys_increment(scan, dir))
+			arrays_exhausted = true;
+		else
+			arrays_advanced = true;
+	}
+
+	if (arrays_advanced)
+	{
+		/*
+		 * We advanced the array keys.  Finalize everything by performing an
+		 * in-place update of the scan's search-type scan keys.
+		 *
+		 * If we missed this final step then any call to _bt_check_compare
+		 * would use stale array keys until such time as _bt_preprocess_keys
+		 * was once again called by _bt_first.
+		 */
+		_bt_update_keys_with_arraykeys(scan);
+		so->advanceDir = dir;
+
+		/*
+		 * If any required array keys were advanced, be prepared to recheck
+		 * the final tuple against the new array keys (as an optimization)
+		 */
+		if (sktrigrequired)
+			pstate->finaltupchecked = false;
+	}
+
+	/*
+	 * If the array keys are now exhausted, end the top-level index scan
+	 */
+	Assert(!so->needPrimScan);
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	if (arrays_exhausted)
+	{
+		Assert(sktrigrequired && !all_required_sk_satisfied);
+
+		pstate->continuescan = false;
+
+		/* Caller's tuple can't match new qual (if any), either */
+		return false;
+	}
+
+	/*
+	 * Postcondition assertions (see header comments for a full explanation).
+	 *
+	 * Tuple must now be <= current/newly advanced required array keys.  Same
+	 * goes for other required equality type scan keys, which are "degenerate
+	 * single value arrays" for our purposes.  (As usual the rule is the same
+	 * for backwards scans once the operators are flipped around.)
+	 *
+	 * Every call here is guaranteed to advance (or exhaust) all required
+	 * arrays, with the sole exception of calls _bt_check_compare triggers
+	 * when it encounters an unsatisfied non-required array scan key.
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, pstate, tuple, 0, false) ==
+		   !all_required_sk_satisfied);
+	Assert(arrays_advanced || !sktrigrequired);
+	Assert(sktrigrequired || all_required_sk_satisfied);
+
+	/*
+	 * The array keys aren't exhausted, so provisionally assume that the
+	 * current primitive index scan will continue
+	 */
+	pstate->continuescan = true;
+
+	/*
+	 * Does caller's tuple now match the new qual?  Call _bt_check_compare a
+	 * second time to find out (unless it's already clear that it can't).
+	 */
+	if (all_arraylike_sk_satisfied && arrays_advanced)
+	{
+		bool		continuescan;
+		int			insktrig = sktrig + 1;
+
+		if (likely(_bt_check_compare(dir, so, tuple, ntupatts, itupdesc,
+									 so->numArrayKeys, &continuescan,
+									 &insktrig, false)))
+			return true;
+
+		/*
+		 * Handle inequalities marked required in the current scan direction.
+		 *
+		 * It's just about possible that our _bt_check_compare call indicates
+		 * that the scan should be terminated due to an unsatisfied inequality
+		 * that wasn't initially recognized as such by us.  Handle this by
+		 * calling ourselves recursively while indicating that the trigger is
+		 * now the inequality that we missed first time around.
+		 *
+		 * Note: we only need to do this in cases where the initial call to
+		 * _bt_check_compare (that led to calling here) gave up upon finding
+		 * an unsatisfied required equality/array scan key before it could
+		 * reach the inequality.  The second _bt_check_compare call took place
+		 * after the array keys were advanced (to array keys that definitely
+		 * match the tuple), so it can't have been overlooked a second time.
+		 *
+		 * Note: this is useful because we won't have to wait until the next
+		 * tuple to advance the array keys a second time (to values that'll
+		 * put the scan ahead of this tuple).  Handling this ourselves isn't
+		 * truly required.  But it avoids complicating our contract.  The only
+		 * alternative is to allow an awkward exception to the general rule
+		 * (the rule about always advancing the arrays to the maximum possible
+		 * extent that caller's tuple can safely allow).
+		 */
+		if (!continuescan)
+		{
+			ScanKey		inequal PG_USED_FOR_ASSERTS_ONLY = so->keyData + insktrig;
+
+			Assert(sktrigrequired && all_required_sk_satisfied);
+			Assert(inequal->sk_strategy != BTEqualStrategyNumber);
+			Assert(((inequal->sk_flags & SK_BT_REQFWD) &&
+					ScanDirectionIsForward(dir)) ||
+				   ((inequal->sk_flags & SK_BT_REQBKWD) &&
+					ScanDirectionIsBackward(dir)));
+
+			return _bt_advance_array_keys(scan, pstate, tuple, insktrig);
+		}
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 *
+	 * If we advanced the array keys (which is now certain except in the case
+	 * where we only needed to deal with non-required arrays), it's possible
+	 * that the scan is now at the start of "matching" tuples (at least by the
+	 * definition used by _bt_tuple_before_array_skeys), but is nevertheless
+	 * still many leaf pages before the position that _bt_first is capable of
+	 * repositioning the scan to.
+	 *
+	 * This can happen when we have an inequality scan key required in the
+	 * opposite direction only, that's less significant than the scan key that
+	 * triggered array advancement during our initial _bt_check_compare call.
+	 * If even finaltup doesn't satisfy this less significant inequality scan
+	 * key once we temporarily flip the scan direction, that indicates that
+	 * even finaltup is before the _bt_first-wise initial position for these
+	 * newly advanced array keys.
+	 */
+	if (all_required_sk_satisfied && foundRequiredOppositeDirOnly &&
+		pstate->finaltup)
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped = -dir;
+		bool		continuescan;
+		int			opsktrig = 0;
+
+		Assert(sktrigrequired && arrays_advanced);
+
+		_bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+						  itupdesc, so->numArrayKeys, &continuescan,
+						  &opsktrig, false);
+
+		if (!continuescan && opsktrig > sktrig)
+		{
+			ScanKey		inequal = so->keyData + opsktrig;
+
+			if (((inequal->sk_flags & SK_BT_REQFWD) &&
+				 ScanDirectionIsForward(flipped)) ||
+				((inequal->sk_flags & SK_BT_REQBKWD) &&
+				 ScanDirectionIsBackward(flipped)))
+			{
+				Assert(inequal->sk_strategy != BTEqualStrategyNumber);
+
+				/*
+				 * Continuing the ongoing primitive index scan as-is risks
+				 * uselessly scanning a huge number of leaf pages from before
+				 * the page that we'll quickly jump to by descending the index
+				 * anew.
+				 *
+				 * Play it safe: start a new primitive index scan.  _bt_first
+				 * is guaranteed to at least move the scan to the next leaf
+				 * page.
+				 */
+				pstate->continuescan = false;
+				so->needPrimScan = true;
+
+				return false;
+			}
+		}
+
+		/*
+		 * Caller's tuple might still be before the _bt_first-wise start of
+		 * matches for the new array keys, but at least finaltup is at or
+		 * ahead of that position.  That's good enough; continue as-is.
+		 */
+	}
+
+	/*
+	 * Caller's tuple is < the newly advanced array keys (or > when this is a
+	 * backwards scan).
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.  Our caller should locate
+	 * the first tuple >= the array keys before long (or locate the first
+	 * tuple <= the array keys before long).
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will either quickly discover the next tuple
+	 * covered by the current array keys, or quickly discover that it needs
+	 * another primitive index scan (using its finaltup precheck) instead.
+	 * Either way, a binary search is unlikely to beat a simple linear search.
+	 *
+	 * It's also not clear that a binary search will be any faster when we
+	 * really do have to search through hundreds of tuples beyond this one.
+	 * Several binary searches (one per array advancement) might be required
+	 * while reading through a single page.  Our linear search is structured
+	 * as one continuous search that just advances the arrays in passing, and
+	 * that only needs a little extra logic to deal with inequality scan keys.
+	 */
+	if (!all_required_sk_satisfied && tuple == pstate->finaltup)
+	{
+		/*
+		 * There is one exception: when the page's final tuple advances the
+		 * array keys without exactly matching keys for any required arrays,
+		 * start a new primitive index scan -- don't let our caller continue
+		 * to the next leaf page.
+		 *
+		 * In the forward scan case, finaltup is the page high key.  We don't
+		 * insist on having an exact match for truncated -inf attributes.
+		 * They're never exactly equal to any real array key, but it makes
+		 * sense to be optimistic about finding matches on the next page.
+		 */
+		Assert(sktrigrequired && arrays_advanced);
+
+		pstate->continuescan = false;
+		so->needPrimScan = true;
+	}
+
+	/* In any case, this indextuple doesn't match the qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -741,6 +1994,19 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here.  They
+ * expect to be able to call _bt_update_keys_with_arraykeys instead.  We need
+ * to be careful about that case when we determine redundancy; equality quals
+ * must not be eliminated as redundant on the basis of array input keys that
+ * might change before another call here can take place.
+ *
+ * Note, however, that the presence of an array scan key doesn't affect how we
+ * determine if index quals are contradictory.  Contradictory qual scans move
+ * on to the next primitive index scan right away, by incrementing the scan's
+ * array keys once control reaches _bt_array_keys_remain.  There won't be a
+ * call to _bt_update_keys_with_arraykeys, so there's nothing for us to break.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -887,8 +2153,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							so->qual_ok = false;
 							return;
 						}
-						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						else if (!(eq->sk_flags & SK_SEARCHARRAY))
+						{
+							/* else discard the redundant non-equality key */
+							xform[j] = NULL;
+						}
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -978,6 +2247,22 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
+		/*
+		 * Is this an array scan key that _bt_preprocess_array_keys merged
+		 * with some earlier array key during its initial preprocessing pass?
+		 */
+		if (cur->sk_flags & SK_BT_RDDNARRAY)
+		{
+			/*
+			 * key is redundant for this primitive index scan (and will be
+			 * redundant during all subsequent primitive index scans)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(j == (BTEqualStrategyNumber - 1));
+			Assert(so->numArrayKeys > 0);
+			continue;
+		}
+
 		/* have we seen one of these before? */
 		if (xform[j] == NULL)
 		{
@@ -991,7 +2276,26 @@ _bt_preprocess_keys(IndexScanDesc scan)
 										 &test_result))
 			{
 				if (test_result)
-					xform[j] = cur;
+				{
+					if (j == (BTEqualStrategyNumber - 1) &&
+						((xform[j]->sk_flags & SK_SEARCHARRAY) ||
+						 (cur->sk_flags & SK_SEARCHARRAY)))
+					{
+						/*
+						 * Must never replace an = array operator ourselves,
+						 * nor can we ever fail to remember an = array
+						 * operator.  _bt_update_keys_with_arraykeys expects
+						 * this.
+						 */
+						ScanKey		outkey = &outkeys[new_numberOfKeys++];
+
+						memcpy(outkey, cur, sizeof(ScanKeyData));
+						if (numberOfEqualCols == attno - 1)
+							_bt_mark_scankey_required(outkey);
+					}
+					else
+						xform[j] = cur;
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1019,6 +2323,95 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+/*
+ *	_bt_update_keys_with_arraykeys() -- Finalize advancing array keys
+ *
+ * Transfers newly advanced array keys that were set in "so->arrayKeyData[]"
+ * over to corresponding "so->keyData[]" scan keys.  Reuses most of the work
+ * that took place within _bt_preprocess_keys, only changing the array keys.
+ *
+ * It's safe to call here while holding a buffer lock, which isn't something
+ * that _bt_preprocess_keys can guarantee.
+ */
+static void
+_bt_update_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	Assert(so->qual_ok);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		Assert((cur->sk_flags & SK_BT_RDDNARRAY) == 0);
+
+		/* Just update equality array scan keys */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Update the scan key's argument */
+		Assert(cur->sk_attno == skeyarray->sk_attno);
+		cur->sk_argument = skeyarray->sk_argument;
+	}
+
+	Assert(arrayidx == so->numArrayKeys);
+}
+
+/*
+ * Verify that the scan's "so->arrayKeyData[]" scan keys are in agreement with
+ * the current "so->keyData[]" search-type scan keys.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Verify so->arrayKeyData[] input key has expected sk_argument */
+		if (skeyarray->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+
+		/* Verify so->arrayKeyData[] input key agrees with output key */
+		if (cur->sk_attno != skeyarray->sk_attno)
+			return false;
+		if (cur->sk_argument != skeyarray->sk_argument)
+			return false;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1352,58 +2745,210 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers call with a high key tuple last in the hopes of having
+ * us set pstate.continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.  Pass finaltup=true for these high key calls.
+ * Backwards scan callers shouldn't do this, but should still let us know
+ * which tuple is last by passing finaltup=true for the final non-pivot tuple
+ * (the non-pivot tuple at page offset number one).
+ *
+ * Callers with equality strategy array scan keys must set up page state that
+ * helps us know when to start or stop primitive index scans on their behalf.
+ * The finaltup tuple should be stashed in pstate.finaltup, so we don't have
+ * to wait until the finaltup call to be able to see what's up with the page.
+ *
+ * Advances the scan's array keys in passing when required.  Note that we rely
+ * on _bt_readpage calling here in page offset number order (for the current
+ * scan direction).  Any other order confuses array advancement.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
+ * finaltup: Is tuple the final one we'll be called with for this page?
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
  * requiredMatchedByPrecheck: indicates that scan keys required for
  * 							  direction scan are already matched
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup, int tupnatts,
 			  bool requiredMatchedByPrecheck)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			numArrayKeys = so->numArrayKeys;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(pstate->continuescan);
+	Assert(!numArrayKeys || so->advanceDir == pstate->dir);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so, tuple, tupnatts, tupdesc,
+							numArrayKeys, &pstate->continuescan, &ikey,
+							requiredMatchedByPrecheck);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * continuescan=false.
+	 */
+	if (!numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.
+	 *
+	 * While we might really need to end the top-level index scan, most of the
+	 * time this just means that the scan needs to reconsider its array keys.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, pstate, tuple, ikey, true))
+	{
+		/*
+		 * Current tuple is < the current array scan keys/equality constraints
+		 * (or > in the backward scan case).  Don't need to advance the array
+		 * keys.  Must decide whether to start a new primitive scan instead.
+		 *
+		 * If this tuple isn't the finaltup for the page, then recheck the
+		 * finaltup stashed in pstate as an optimization.  That allows us to
+		 * quit scanning this page early when it's clearly hopeless (we don't
+		 * need to wait for the finaltup call to give up on a primitive scan).
+		 */
+		if (finaltup || (!pstate->finaltupchecked && pstate->finaltup &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->finaltup,
+													  0, false)))
+		{
+			/*
+			 * Give up on the ongoing primitive index scan.
+			 *
+			 * Even the final tuple (the high key for forward scans, or the
+			 * tuple from page offset number 1 for backward scans) is before
+			 * the current array keys.  That strongly suggests that continuing
+			 * this primitive scan would be less efficient than starting anew.
+			 *
+			 * See also: _bt_advance_array_keys's handling of the case where
+			 * finaltup itself advances the array keys to non-matching values.
+			 */
+			pstate->continuescan = false;
+
+			/*
+			 * Set up a new primitive index scan that will reposition the
+			 * top-level scan to the first leaf page whose key space is
+			 * covered by our array keys.  The top-level scan will "skip" a
+			 * part of the index that can only contain non-matching tuples.
+			 *
+			 * Note: the next primitive index scan is guaranteed to land on
+			 * some later leaf page (ideally it won't be this page's sibling).
+			 * It follows that the top-level scan can never access the same
+			 * leaf page more than once (unless the scan changes direction or
+			 * btrestrpos is called).  btcostestimate relies on this.
+			 */
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/*
+			 * Stick with the ongoing primitive index scan, for now (override
+			 * _bt_check_compare's suggestion that we end the scan).
+			 *
+			 * Note: we will end up here again and again given a group of
+			 * tuples > the previous array keys and < the now-current keys
+			 * (though only after an initial finaltup precheck determined that
+			 * this page definitely covers key space from both array keysets).
+			 * In effect, we perform a linear search of the page's remaining
+			 * unscanned tuples every time the arrays advance past the key
+			 * space of the scan's then-current tuple.
+			 */
+			pstate->continuescan = true;
+
+			/*
+			 * Our finaltup precheck determined that it is >= the current keys
+			 * (though the _current_ tuple is still < the current array keys).
+			 *
+			 * Remember that fact in pstate now.  This avoids wasting cycles
+			 * on repeating the same precheck step (checking the same finaltup
+			 * against the same array keys) during later calls here for later
+			 * tuples from this same leaf page.
+			 */
+			pstate->finaltupchecked = true;
+		}
+
+		/* In any case, this indextuple doesn't match the qual */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (unless the required array keys become exhausted instead, or
+	 * unless the ikey trigger corresponds to a non-required array scan key).
+	 *
+	 * Note: we might even advance the required arrays when all existing keys
+	 * are already equal to the values from the tuple at this point.  See the
+	 * comments above _bt_advance_array_keys about required-inequality-driven
+	 * array advancement.
+	 *
+	 * Note: we _won't_ advance any required arrays when the ikey/trigger scan
+	 * key corresponds to a non-required array found to be unsatisfied by the
+	 * current keys.  (We might not even "advance" the non-required array.)
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, ikey);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (which has more high
+ * level context than us) to override that initial determination when it makes
+ * more sense to advance the array keys and continue with further tuples from
+ * the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  int numArrayKeys, bool *continuescan, int *ikey,
+				  bool requiredMatchedByPrecheck)
+{
+	ScanKey		key;
+	int			keysz;
+
+	Assert(!numArrayKeys || !requiredMatchedByPrecheck);
 
 	*continuescan = true;		/* default assumption */
-
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
 	keysz = so->numberOfKeys;
 
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = so->keyData + *ikey; *ikey < keysz; key++, (*ikey)++)
 	{
 		Datum		datum;
 		bool		isNull;
 		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * Is the key required for scanning for either forward or backward
@@ -1411,7 +2956,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * known to be matched, skip the check.  Except for the row keys,
 		 * where NULLs could be found in the middle of matching values.
 		 */
-		if ((requiredSameDir || requiredOppositeDir) &&
+		if ((requiredSameDir || requiredOppositeDirOnly) &&
 			!(key->sk_flags & SK_ROW_HEADER) && requiredMatchedByPrecheck)
 			continue;
 
@@ -1423,7 +2968,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1514,11 +3058,28 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 
 		/*
 		 * Apply the key checking function.  When the key is required for
-		 * opposite direction scan, it must be already satisfied by
-		 * _bt_first() except for the NULLs checking, which have already done
-		 * above.
+		 * opposite-direction scans it must be an inequality satisfied by
+		 * _bt_first(), barring NULLs, which we just checked a moment ago.
+		 *
+		 * (Also can't apply this optimization with scans that use arrays,
+		 * since _bt_advance_array_keys() sometimes allows the scan to see a
+		 * few tuples from before the would-be _bt_first() starting position
+		 * for the scan's just-advanced array keys.)
+		 *
+		 * Even required equality quals (that can't use this optimization due
+		 * to being required in both scan directions) rely on the assumption
+		 * that _bt_first() will always use the quals for initial positioning
+		 * purposes.  We stop the scan as soon as any required equality qual
+		 * fails, so it had better only happen at the end of equal tuples in
+		 * the current scan direction (never at the start of equal tuples).
+		 * See comments in _bt_first().
+		 *
+		 * (The required equality quals issue also has specific implications
+		 * for scans that use arrays.  They sometimes perform a linear search
+		 * of remaining unscanned tuples, forcing the primitive index scan to
+		 * continue until it locates tuples >= the scan's new array keys.)
 		 */
-		if (!requiredOppositeDir)
+		if (!requiredOppositeDirOnly || numArrayKeys)
 		{
 			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
 									 datum, key->sk_argument);
@@ -1536,15 +3097,25 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * Tuple fails this qual.  If it's a required qual for the current
 			 * scan direction, then we can conclude no further tuples will
 			 * pass, either.
-			 *
-			 * Note: because we stop the scan as soon as any required equality
-			 * qual fails, it is critical that equality quals be used for the
-			 * initial positioning in _bt_first() when they are available. See
-			 * comments in _bt_first().
 			 */
 			if (requiredSameDir)
 				*continuescan = false;
 
+			/*
+			 * Always set continuescan=false for equality-type array keys that
+			 * don't pass -- even for an array scan key not marked required.
+			 *
+			 * A non-required scan key (array or otherwise) can never actually
+			 * terminate the scan.  It's just convenient for callers to treat
+			 * continuescan=false as a signal that it might be time to advance
+			 * the array keys, independent of whether they're required or not.
+			 * (Even setting continuescan=false with a required scan key won't
+			 * usually end a scan that uses arrays.)
+			 */
+			if (numArrayKeys && (key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+				*continuescan = false;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1563,7 +3134,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1592,7 +3163,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 03a5fbdc6..e37597c26 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +862,20 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			/*
+			 * We might need to omit ScalarArrayOpExpr clauses when index AM
+			 * lacks native support
+			 */
+			if (!index->amsearcharray && IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
+				if (skip_nonnative_saop)
 				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
+					/* Ignore because not supported by index */
+					*skip_nonnative_saop = true;
+					continue;
 				}
+				/* Caller had better intend this only for bitmap scan */
+				Assert(scantype == ST_BITMAPSCAN);
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e11d02282..f96c7b5dc 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6514,8 +6514,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6530,7 +6528,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6560,19 +6558,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6610,27 +6597,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6640,11 +6631,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6660,10 +6649,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6676,7 +6663,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6754,7 +6741,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6769,17 +6755,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6819,14 +6800,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				int			alength = estimate_array_length(other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6886,13 +6862,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6902,6 +6871,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6909,9 +6920,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6928,7 +6939,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 42509042a..38523b832 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4035,6 +4035,21 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 8311a03c3..d159091ab 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -434,3 +434,482 @@ ALTER INDEX btree_part_idx ALTER COLUMN id SET (n_distinct=100);
 ERROR:  ALTER action ALTER COLUMN ... SET cannot be performed on relation "btree_part_idx"
 DETAIL:  This operation is not supported for partitioned indexes.
 DROP TABLE btree_part;
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.
+create temp table backup_wrong_tbl (district int4, warehouse int4, orderid int4, orderline int4);
+create index backup_wrong_idx on backup_wrong_tbl (district, warehouse, orderid, orderline);
+insert into backup_wrong_tbl
+select district, warehouse, orderid, orderline
+from
+  generate_series(1, 3) district,
+  generate_series(1, 2) warehouse,
+  generate_series(1, 51) orderid,
+  generate_series(1, 10) orderline;
+begin;
+declare back_up_terminate_toplevel_wrong cursor for
+select * from backup_wrong_tbl
+where district in (1, 3) and warehouse in (1,2)
+and orderid in (48, 50)
+order by district, warehouse, orderid, orderline;
+fetch forward 60 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      48 |         1
+        1 |         1 |      48 |         2
+        1 |         1 |      48 |         3
+        1 |         1 |      48 |         4
+        1 |         1 |      48 |         5
+        1 |         1 |      48 |         6
+        1 |         1 |      48 |         7
+        1 |         1 |      48 |         8
+        1 |         1 |      48 |         9
+        1 |         1 |      48 |        10
+        1 |         1 |      50 |         1
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         6
+        3 |         1 |      48 |         7
+        3 |         1 |      48 |         8
+        3 |         1 |      48 |         9
+        3 |         1 |      48 |        10
+        3 |         1 |      50 |         1
+        3 |         1 |      50 |         2
+        3 |         1 |      50 |         3
+        3 |         1 |      50 |         4
+        3 |         1 |      50 |         5
+        3 |         1 |      50 |         6
+        3 |         1 |      50 |         7
+        3 |         1 |      50 |         8
+        3 |         1 |      50 |         9
+        3 |         1 |      50 |        10
+(60 rows)
+
+fetch backward 29 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      50 |         9
+        3 |         1 |      50 |         8
+        3 |         1 |      50 |         7
+        3 |         1 |      50 |         6
+        3 |         1 |      50 |         5
+        3 |         1 |      50 |         4
+        3 |         1 |      50 |         3
+        3 |         1 |      50 |         2
+        3 |         1 |      50 |         1
+        3 |         1 |      48 |        10
+        3 |         1 |      48 |         9
+        3 |         1 |      48 |         8
+        3 |         1 |      48 |         7
+        3 |         1 |      48 |         6
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+(29 rows)
+
+fetch forward 12 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+(12 rows)
+
+fetch backward 30 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+(30 rows)
+
+fetch forward  31 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+(31 rows)
+
+fetch backward 32 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         2
+(32 rows)
+
+fetch forward  33 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+(33 rows)
+
+fetch backward 34 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         1
+(34 rows)
+
+fetch forward  35 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         6
+(35 rows)
+
+commit;
+create temp table outer_table                  (a int, b int);
+create temp table restore_buggy_primscan_table (x int, y int);
+create index buggy_idx on restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+insert into outer_table                  select 1, b_vals from generate_series(1006, 1580) b_vals;
+insert into restore_buggy_primscan_table select 1, x_vals from generate_series(1006, 1580) x_vals;
+insert into outer_table                  select 1, 1370 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1371 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1380 from generate_series(1, 9) j;
+vacuum analyze outer_table;
+vacuum analyze restore_buggy_primscan_table;
+select count(*), o.a, o.b
+  from
+    outer_table o
+  inner join
+    restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x = 1 and
+  bug.y = any(array[(select array_agg(i) from generate_series(1370, 1390) i where i % 10 = 0)])
+group by o.a, o.b;
+ count | a |  b   
+-------+---+------
+    10 | 1 | 1370
+    10 | 1 | 1380
+     1 | 1 | 1390
+(3 rows)
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys().  This is handled like the case where the scan
+-- direction changes "within" a page, relying on code from _bt_readnextpage().
+create temp table outer_tab(
+  a int,
+  b int
+);
+create index outer_tab_idx on outer_tab(a, b) with (deduplicate_items = off);
+create temp table primscanmarkcov_table(
+  a int,
+  b int
+);
+create index interesting_coverage_idx on primscanmarkcov_table(a, b) with (deduplicate_items = off);
+insert into outer_tab             select 1, i from generate_series(1530, 1780) i;
+insert into primscanmarkcov_table select 1, i from generate_series(1530, 1780) i;
+insert into outer_tab             select 1, 1550 from generate_series(1, 200) i;
+insert into primscanmarkcov_table select 1, 1551 from generate_series(1, 200) i;
+vacuum analyze outer_tab;
+vacuum analyze primscanmarkcov_table ;
+with range_ints as ( select i from generate_series(1530, 1780) i)
+select
+  count(*), buggy.a, buggy.b from
+outer_tab o
+  inner join
+primscanmarkcov_table buggy
+  on o.a = buggy.a and o.b = buggy.b
+where
+  o.a = 1     and     o.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])  and
+  buggy.a = 1 and buggy.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])
+group by buggy.a, buggy.b
+order by buggy.a, buggy.b;
+ count | a |  b   
+-------+---+------
+   201 | 1 | 1550
+     1 | 1 | 1600
+     1 | 1 | 1650
+     1 | 1 | 1700
+     1 | 1 | 1750
+(5 rows)
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys() for backwards scans.  More or less comparable to
+-- the last test.
+create temp table backwards_prim_outer_table             (a int, b int);
+create temp table backwards_restore_buggy_primscan_table (x int, y int);
+create index backward_prim_buggy_idx  on backwards_restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+create index backwards_prim_drive_idx on backwards_prim_outer_table             (a, b) with (deduplicate_items=off);
+insert into backwards_prim_outer_table                  select 0, 1360;
+insert into backwards_prim_outer_table                  select 1, b_vals from generate_series(1012, 1406) b_vals where b_vals % 10 = 0;
+insert into backwards_prim_outer_table                  select 1, 1370;
+vacuum analyze backwards_prim_outer_table; -- Be tidy
+-- Fill up "backwards_prim_drive_idx" index with 396 items, just about fitting
+-- onto its only page, which is a root leaf page:
+insert into backwards_restore_buggy_primscan_table select 0, 1360;
+insert into backwards_restore_buggy_primscan_table select 1, x_vals from generate_series(1012, 1406) x_vals;
+vacuum analyze backwards_restore_buggy_primscan_table; -- Be tidy
+-- Now cause two page splits, leaving 4 leaf pages in total:
+insert into backwards_restore_buggy_primscan_table select 1, 1370 from generate_series(1,250) i;
+-- Now "buggy" index looks like this:
+--
+-- ┌───┬───────┬───────┬────────┬────────┬────────────┬───────┬───────┬───────────────────┬─────────┬───────────┬──────────────────┐
+-- │ i │ blkno │ flags │ nhtids │ nhblks │ ndeadhblks │ nlive │ ndead │ nhtidschecksimple │ avgsize │ freespace │     highkey      │
+-- ├───┼───────┼───────┼────────┼────────┼────────────┼───────┼───────┼───────────────────┼─────────┼───────────┼──────────────────┤
+-- │ 1 │     1 │     1 │    203 │      1 │          0 │   204 │     0 │                 0 │      16 │     4,068 │ (x, y)=(1, 1214) │
+-- │ 2 │     4 │     1 │    156 │      2 │          0 │   157 │     0 │                 0 │      16 │     5,008 │ (x, y)=(1, 1370) │
+-- │ 3 │     5 │     1 │    251 │      2 │          0 │   252 │     0 │                 0 │      16 │     3,108 │ (x, y)=(1, 1371) │
+-- │ 4 │     2 │     1 │     36 │      1 │          0 │    36 │     0 │                 0 │      16 │     7,428 │ ∅                │
+-- └───┴───────┴───────┴────────┴────────┴────────────┴───────┴───────┴───────────────────┴─────────┴───────────┴──────────────────┘
+select count(*), o.a, o.b
+  from
+    backwards_prim_outer_table o
+  inner join
+    backwards_restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x in (0, 1) and
+  bug.y = any(array[(select array_agg(i) from generate_series(1360, 1370) i where i % 10 = 0)])
+group by o.a, o.b
+order by o.a desc, o.b desc;
+ count | a |  b   
+-------+---+------
+   502 | 1 | 1370
+     1 | 1 | 1360
+     1 | 0 | 1360
+(3 rows)
+
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 446cfa678..b50409c7f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1951,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 2c7327014..86e541780 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8680,10 +8680,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index ef8435423..330edbb1d 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -267,3 +267,150 @@ CREATE TABLE btree_part (id int4) PARTITION BY RANGE (id);
 CREATE INDEX btree_part_idx ON btree_part(id);
 ALTER INDEX btree_part_idx ALTER COLUMN id SET (n_distinct=100);
 DROP TABLE btree_part;
+
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.
+create temp table backup_wrong_tbl (district int4, warehouse int4, orderid int4, orderline int4);
+create index backup_wrong_idx on backup_wrong_tbl (district, warehouse, orderid, orderline);
+insert into backup_wrong_tbl
+select district, warehouse, orderid, orderline
+from
+  generate_series(1, 3) district,
+  generate_series(1, 2) warehouse,
+  generate_series(1, 51) orderid,
+  generate_series(1, 10) orderline;
+
+begin;
+declare back_up_terminate_toplevel_wrong cursor for
+select * from backup_wrong_tbl
+where district in (1, 3) and warehouse in (1,2)
+and orderid in (48, 50)
+order by district, warehouse, orderid, orderline;
+
+fetch forward 60 from back_up_terminate_toplevel_wrong;
+fetch backward 29 from back_up_terminate_toplevel_wrong;
+fetch forward 12 from back_up_terminate_toplevel_wrong;
+fetch backward 30 from back_up_terminate_toplevel_wrong;
+fetch forward  31 from back_up_terminate_toplevel_wrong;
+fetch backward 32 from back_up_terminate_toplevel_wrong;
+fetch forward  33 from back_up_terminate_toplevel_wrong;
+fetch backward 34 from back_up_terminate_toplevel_wrong;
+fetch forward  35 from back_up_terminate_toplevel_wrong;
+commit;
+
+create temp table outer_table                  (a int, b int);
+create temp table restore_buggy_primscan_table (x int, y int);
+
+create index buggy_idx on restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+
+insert into outer_table                  select 1, b_vals from generate_series(1006, 1580) b_vals;
+insert into restore_buggy_primscan_table select 1, x_vals from generate_series(1006, 1580) x_vals;
+
+insert into outer_table                  select 1, 1370 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1371 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1380 from generate_series(1, 9) j;
+
+vacuum analyze outer_table;
+vacuum analyze restore_buggy_primscan_table;
+
+select count(*), o.a, o.b
+  from
+    outer_table o
+  inner join
+    restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x = 1 and
+  bug.y = any(array[(select array_agg(i) from generate_series(1370, 1390) i where i % 10 = 0)])
+group by o.a, o.b;
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys().  This is handled like the case where the scan
+-- direction changes "within" a page, relying on code from _bt_readnextpage().
+create temp table outer_tab(
+  a int,
+  b int
+);
+create index outer_tab_idx on outer_tab(a, b) with (deduplicate_items = off);
+
+create temp table primscanmarkcov_table(
+  a int,
+  b int
+);
+create index interesting_coverage_idx on primscanmarkcov_table(a, b) with (deduplicate_items = off);
+
+insert into outer_tab             select 1, i from generate_series(1530, 1780) i;
+insert into primscanmarkcov_table select 1, i from generate_series(1530, 1780) i;
+
+insert into outer_tab             select 1, 1550 from generate_series(1, 200) i;
+insert into primscanmarkcov_table select 1, 1551 from generate_series(1, 200) i;
+
+vacuum analyze outer_tab;
+vacuum analyze primscanmarkcov_table ;
+
+with range_ints as ( select i from generate_series(1530, 1780) i)
+
+select
+  count(*), buggy.a, buggy.b from
+outer_tab o
+  inner join
+primscanmarkcov_table buggy
+  on o.a = buggy.a and o.b = buggy.b
+where
+  o.a = 1     and     o.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])  and
+  buggy.a = 1 and buggy.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])
+group by buggy.a, buggy.b
+order by buggy.a, buggy.b;
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys() for backwards scans.  More or less comparable to
+-- the last test.
+create temp table backwards_prim_outer_table             (a int, b int);
+create temp table backwards_restore_buggy_primscan_table (x int, y int);
+
+create index backward_prim_buggy_idx  on backwards_restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+create index backwards_prim_drive_idx on backwards_prim_outer_table             (a, b) with (deduplicate_items=off);
+
+insert into backwards_prim_outer_table                  select 0, 1360;
+insert into backwards_prim_outer_table                  select 1, b_vals from generate_series(1012, 1406) b_vals where b_vals % 10 = 0;
+insert into backwards_prim_outer_table                  select 1, 1370;
+vacuum analyze backwards_prim_outer_table; -- Be tidy
+
+-- Fill up "backwards_prim_drive_idx" index with 396 items, just about fitting
+-- onto its only page, which is a root leaf page:
+insert into backwards_restore_buggy_primscan_table select 0, 1360;
+insert into backwards_restore_buggy_primscan_table select 1, x_vals from generate_series(1012, 1406) x_vals;
+vacuum analyze backwards_restore_buggy_primscan_table; -- Be tidy
+
+-- Now cause two page splits, leaving 4 leaf pages in total:
+insert into backwards_restore_buggy_primscan_table select 1, 1370 from generate_series(1,250) i;
+
+-- Now "buggy" index looks like this:
+--
+-- ┌───┬───────┬───────┬────────┬────────┬────────────┬───────┬───────┬───────────────────┬─────────┬───────────┬──────────────────┐
+-- │ i │ blkno │ flags │ nhtids │ nhblks │ ndeadhblks │ nlive │ ndead │ nhtidschecksimple │ avgsize │ freespace │     highkey      │
+-- ├───┼───────┼───────┼────────┼────────┼────────────┼───────┼───────┼───────────────────┼─────────┼───────────┼──────────────────┤
+-- │ 1 │     1 │     1 │    203 │      1 │          0 │   204 │     0 │                 0 │      16 │     4,068 │ (x, y)=(1, 1214) │
+-- │ 2 │     4 │     1 │    156 │      2 │          0 │   157 │     0 │                 0 │      16 │     5,008 │ (x, y)=(1, 1370) │
+-- │ 3 │     5 │     1 │    251 │      2 │          0 │   252 │     0 │                 0 │      16 │     3,108 │ (x, y)=(1, 1371) │
+-- │ 4 │     2 │     1 │     36 │      1 │          0 │    36 │     0 │                 0 │      16 │     7,428 │ ∅                │
+-- └───┴───────┴───────┴────────┴────────┴────────────┴───────┴───────┴───────────────────┴─────────┴───────────┴──────────────────┘
+
+select count(*), o.a, o.b
+  from
+    backwards_prim_outer_table o
+  inner join
+    backwards_restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x in (0, 1) and
+  bug.y = any(array[(select array_agg(i) from generate_series(1360, 1370) i where i % 10 = 0)])
+group by o.a, o.b
+order by o.a desc, o.b desc;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..9d68ef624 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,18 +774,14 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.43.0

#33

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Peter Geoghegan (#32)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v8, which pretty much rips all of this stuff out.

Attached is v9, which I'm posting just to fix bitrot. The patch
stopped cleanly applying against HEAD due to recent bugfix commit
7e6fb5da. No real changes here compared to v8.

--
Peter Geoghegan

Attachments:

v9-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/x-patch; name=v9-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From cccb8ea1ea5036e2288e2a56667e82cda12706e8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v9] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The array state machine now advances using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   49 +-
 src/backend/access/nbtree/nbtree.c         |   80 +-
 src/backend/access/nbtree/nbtsearch.c      |  122 +-
 src/backend/access/nbtree/nbtutils.c       | 1817 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   86 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   15 +
 src/test/regress/expected/btree_index.out  |  479 ++++++
 src/test/regress/expected/create_index.out |   31 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/btree_index.sql       |  147 ++
 src/test/regress/sql/create_index.sql      |   10 +-
 12 files changed, 2602 insertions(+), 361 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 75417ca4a..27e944c60 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,7 +960,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1024,7 +1024,6 @@ typedef struct BTArrayKeyInfo
 {
 	int			scan_key;		/* index of associated key in arrayKeyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1038,13 +1037,14 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	ScanDirection advanceDir;	/* Scan direction when arrays last advanced */
+	bool		needPrimScan;	/* Need primscan to continue in advanceDir? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1075,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the final tuple from the page.  This must happen
+ * before the first call to _bt_checkkeys.  _bt_checkkeys uses the final tuple
+ * to manage advancement of the scan's array keys more efficiently.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* final tuple (high key for forward scans) */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		finaltupchecked;	/* final tuple checked against current
+									 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1082,6 +1105,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_RDDNARRAY	0x00040000	/* redundant in array preprocessing */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1152,7 +1176,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1245,13 +1269,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_rewind_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup, int tupnatts,
+						  bool continuescanPrechecked, bool haveFirstMatch);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index dd6dc0971..db8aa11ef 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -235,7 +235,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		_bt_start_array_keys(scan, dir);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -277,8 +277,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -305,7 +305,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 		_bt_start_array_keys(scan, ForwardScanDirection);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -335,8 +335,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -366,9 +366,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -408,7 +410,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -507,10 +511,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -521,10 +521,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -563,6 +559,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Rewind the scan's array keys, if any */
+			if (so->numArrayKeys)
+				_bt_rewind_array_keys(scan);
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -589,7 +588,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -615,7 +614,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -626,7 +625,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -657,16 +660,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -698,9 +702,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -734,12 +741,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -753,14 +759,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -769,13 +775,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 82122464d..c246587cd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,7 +907,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -1527,10 +1527,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
+	BTReadPageState pstate;
+	int			numArrayKeys,
+				itemIndex;
 	int			indnatts;
-	bool		continuescanPrechecked;
+	bool		continuescanPrechecked = false;
 	bool		haveFirstMatch = false;
 
 	/*
@@ -1551,8 +1552,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.finaltupchecked = false;
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	numArrayKeys = so->numArrayKeys;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1599,9 +1605,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * the last item on the page would give a more precise answer.
 	 *
 	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * slowdown of point queries.  Never apply the optimization with a scan
+	 * that uses array keys, either, since that breaks certain assumptions.
+	 * (Our search-type scan keys change whenever _bt_checkkeys advances the
+	 * arrays, invalidating any precheck.  Tracking all that would be tricky.)
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !numArrayKeys && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1619,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Flag variable is set when all scan keys that are required in the
+		 * current scan direction are satisfied by the last item on the page
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, itup, false, indnatts, false, false);
+		continuescanPrechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,8 +1661,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, false, indnatts,
 										 continuescanPrechecked,
 										 haveFirstMatch);
 
@@ -1659,9 +1670,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 * assert-enabled builds we also recheck that the _bt_checkkeys()
 			 * result is the same.
 			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			Assert((!continuescanPrechecked && haveFirstMatch) || numArrayKeys ||
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+												 indnatts, false, false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
@@ -1696,7 +1707,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1724,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			_bt_checkkeys(scan, &pstate, itup, true, truncatt, false, false);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1744,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (numArrayKeys && minoff <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1744,6 +1763,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1754,12 +1774,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1772,9 +1798,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+										 indnatts, continuescanPrechecked,
 										 haveFirstMatch);
 
 			/*
@@ -1782,9 +1807,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 * assert-enabled builds we also recheck that the _bt_checkkeys()
 			 * result is the same.
 			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			Assert((!continuescanPrechecked && !haveFirstMatch) || numArrayKeys ||
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup,
+												 finaltup, indnatts,
+												 false, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1824,7 +1850,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1999,6 +2025,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the right
+		 */
+		if (ScanDirectionIsBackward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreRight = true;
+			so->advanceDir = dir;
+			so->needPrimScan = false;
+		}
+
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 	}
@@ -2007,6 +2047,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreRight = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the left
+		 */
+		if (ScanDirectionIsForward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreLeft = true;
+			so->advanceDir = dir;
+			so->needPrimScan = false;
+		}
+
 		if (scan->parallel_scan != NULL)
 		{
 			/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2b77936b1..b68b1a570 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,42 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
+static int	_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple, int sktrig,
+										 bool validtrig);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int sktrig);
+static void _bt_update_keys_with_arraykeys(IndexScanDesc scan);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  int numArrayKeys, bool *continuescan, int *ikey,
+							  bool continuescanPrechecked, bool haveFirstMatch);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -190,13 +217,48 @@ _bt_freestack(BTStack stack)
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
  * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * array elements.
+ *
+ * _bt_preprocess_keys treats each primitive scan as an independent piece of
+ * work.  We perform all preprocessing that must work "across array keys".
+ * This division of labor makes sense once you consider that we're called only
+ * once per btrescan, whereas _bt_preprocess_keys is called once per primitive
+ * index scan.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array key as redundant.  Similarly, we are capable of "merging
+ * together" multiple equality array keys from two or more input scan keys
+ * into a single output scan key that contains only the intersecting array
+ * elements.  This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.
+ *
+ * Note: _bt_start_array_keys actually sets up the cur_elem counters later on,
+ * once the scan direction is known.
  *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
+ *
+ * Note: _bt_preprocess_keys is responsible for creating the so->keyData scan
+ * keys used by _bt_checkkeys.  Index scans that don't use equality array keys
+ * will have _bt_preprocess_keys treat scan->keyData as input and so->keyData
+ * as output.  Scans that use equality array keys have _bt_preprocess_keys
+ * treat so->arrayKeyData (which is our output) as their input, while (as per
+ * usual) outputting so->keyData for _bt_checkkeys.  This function adds an
+ * additional layer of indirection that allows _bt_preprocess_keys to more or
+ * less avoid dealing with SK_SEARCHARRAY as a special case.
+ *
+ * Note: _bt_update_keys_with_arraykeys works by updating already-processed
+ * output keys (so->keyData) in-place.  It cannot eliminate redundant or
+ * contradictory scan keys.  This necessitates having _bt_preprocess_keys
+ * understand that it is unsafe to eliminate "redundant" SK_SEARCHARRAY
+ * equality scan keys on the basis of what is actually just the current array
+ * key values -- it must conservatively assume that such a scan key might no
+ * longer be redundant after the next _bt_update_keys_with_arraykeys call.
+ * Ideally we'd be able to deal with that by eliminating a subset of truly
+ * redundant array keys up-front, but it doesn't seem worth the trouble.
  */
 void
 _bt_preprocess_array_keys(IndexScanDesc scan)
@@ -204,7 +266,10 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
+	int			lastEqualityArrayAtt = -1;
+	Oid			lastOrderProc = InvalidOid;
 	ScanKey		cur;
 	int			i;
 	MemoryContext oldContext;
@@ -257,6 +322,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc0(nkeyatts * sizeof(FmgrInfo));
+	so->advanceDir = NoMovementScanDirection;
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -273,6 +340,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.   Set
+		 * that up now.  (Avoids repeating work for the same attribute.)
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			!OidIsValid(so->orderProcs[cur->sk_attno - 1].fn_oid))
+			_bt_sort_array_cmp_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -349,6 +426,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and redundant whole scan keys)
+		 */
+		if (lastEqualityArrayAtt == cur->sk_attno &&
+			lastOrderProc == cur->sk_func.fn_oid)
+		{
+			BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+			Assert(so->arrayKeyData[prev->scan_key].sk_subtype ==
+				   cur->sk_subtype);
+
+			num_elems = _bt_merge_arrays(scan, cur,
+										 (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+										 prev->elem_values, prev->num_elems,
+										 elem_values, num_elems);
+
+			pfree(elem_values);
+
+			/*
+			 * If there are no intersecting elements left from merging this
+			 * array into the previous array on the same attribute, the scan
+			 * qual is unsatisfiable
+			 */
+			if (num_elems == 0)
+			{
+				numArrayKeys = -1;
+				break;
+			}
+
+			/*
+			 * Lower the number of elements from the previous array, and mark
+			 * this scan key/array as redundant for every primitive index scan
+			 */
+			prev->num_elems = num_elems;
+			cur->sk_flags |= SK_BT_RDDNARRAY;
+			continue;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
 		 */
@@ -356,6 +473,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		lastEqualityArrayAtt = cur->sk_attno;
+		lastOrderProc = cur->sk_func.fn_oid;
 	}
 
 	so->numArrayKeys = numArrayKeys;
@@ -429,26 +548,28 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 }
 
 /*
- * _bt_sort_array_elements() -- sort and de-dup array elements
+ * _bt_sort_array_cmp_setup() -- Look up array comparison function
  *
- * The array elements are sorted in-place, and the new number of elements
- * after duplicate removal is returned.
- *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * Sets so->orderProcs[] for scan key's attribute.  This is used to sort and
+ * deduplicate the attribute's array (if any).  It's also used during binary
+ * searches of the next array key matching index tuples just beyond the range
+ * of the scan's current set of array keys.
  */
-static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
-						Datum *elems, int nelems)
+static void
+_bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey)
 {
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 	Oid			elemtype;
 	RegProcedure cmp_proc;
-	BTSortArrayContext cxt;
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
 
-	if (nelems <= 1)
-		return nelems;			/* no work to do */
+	/*
+	 * Should do this for all equality strategy scan keys only (including
+	 * those without any array).  See _bt_advance_array_keys for details of
+	 * why we need an ORDER proc for non-array equality strategy scan keys.
+	 */
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
 
 	/*
 	 * Determine the nominal datatype of the array elements.  We have to
@@ -462,22 +583,44 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 	/*
 	 * Look up the appropriate comparison function in the opfamily.
 	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
+	 * Note: it's possible that this would fail, if the opfamily lacks the
+	 * required cross-type ORDER proc.  But this is no different to the case
+	 * where _bt_first fails to find an ORDER proc for its insertion scan key.
 	 */
 	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
+								 rel->rd_opcintype[skey->sk_attno - 1], elemtype,
 								 BTORDER_PROC);
 	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, rel->rd_opcintype[skey->sk_attno - 1], elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Save in orderproc entry for attribute */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
+/*
+ * _bt_sort_array_elements() -- sort and de-dup array elements
+ *
+ * The array elements are sorted in-place, and the new number of elements
+ * after duplicate removal is returned.
+ *
+ * scan and skey identify the index column, whose opfamily determines the
+ * comparison semantics.  If reverse is true, we sort in descending order.
+ */
+static int
+_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
+						bool reverse,
+						Datum *elems, int nelems)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+
+	if (nelems <= 1)
+		return nelems;			/* no work to do */
 
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -488,6 +631,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	int			merged_nelems = 0;
+
+	/*
+	 * Incrementally copy the original array into a temp buffer, skipping over
+	 * any items that are missing from the "next" array
+	 */
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0; i < nelems_orig; i++)
+	{
+		Datum	   *elem = elems_orig + i;
+
+		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+						_bt_compare_array_elements, &cxt))
+			merged[merged_nelems++] = *elem;
+	}
+
+	/*
+	 * Overwrite the original array with temp buffer so that we're only left
+	 * with intersecting array elements
+	 */
+	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	pfree(merged);
+
+	return merged_nelems;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -499,7 +684,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -507,6 +692,159 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  It's safe for searches against required scan key arrays to
+ * reuse earlier search bounds like this because such arrays always advance in
+ * lockstep with the index scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem,
+				mid_elem,
+				high_elem,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	low_elem = 0;
+	mid_elem = -1;
+	high_elem = array->num_elems - 1;
+	if (cur_elem_start)
+	{
+		if (ScanDirectionIsForward(dir))
+			low_elem = array->cur_elem;
+		else
+			high_elem = array->cur_elem;
+	}
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * it's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the array element we'll return.  Set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -532,29 +870,40 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
 
-	so->arraysStarted = true;
+	so->advanceDir = dir;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
+ *
+ * Note: routine only initializes so->arrayKeyData[] scankeys.  Caller must
+ * either call _bt_update_keys_with_arraykeys or call _bt_preprocess_keys to
+ * update the scan's search-type scankeys.
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		found = false;
-	int			i;
+
+	Assert(!so->needPrimScan);
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
@@ -588,85 +937,989 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+	if (found)
+		return true;
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * Don't allow the entire set of array keys to roll over: restore the
+	 * array keys to the state they were in before we were called.
+	 *
+	 * This ensures that the array keys only ratchet forward (or backwards in
+	 * the case of backward scans).  Our "so->arrayKeyData[]" scan keys should
+	 * always match the current "so->keyData[]" search-type scan keys (except
+	 * for a brief moment during array key advancement).
 	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
-}
-
-/*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
- *
- * Save the current state of the array keys as the "mark" position.
- */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
-
-	for (i = 0; i < so->numArrayKeys; i++)
+	for (int i = 0; i < so->numArrayKeys; i++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		BTArrayKeyInfo *rollarray = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[rollarray->scan_key];
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
+		if (ScanDirectionIsBackward(dir))
+			rollarray->cur_elem = 0;
+		else
+			rollarray->cur_elem = rollarray->num_elems - 1;
+		skey->sk_argument = rollarray->elem_values[rollarray->cur_elem];
 	}
+
+	return false;
 }
 
 /*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
+ * _bt_rewind_array_keys() -- Handle array keys during btrestrpos
  *
- * Restore the array keys to where they were when the mark was set.
+ * Restore the array keys to the start of the key space for the current scan
+ * direction.
  */
 void
-_bt_restore_array_keys(IndexScanDesc scan)
+_bt_rewind_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		changed = false;
-	int			i;
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->advanceDir != NoMovementScanDirection);
+
+	/*
+	 * Restore each array key to its initial position for the current scan
+	 * direction as of the last time the arrays advanced
+	 */
+	for (int i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		int			first_elem_dir;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		if (ScanDirectionIsForward(so->advanceDir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = curArrayKey->num_elems - 1;
+
+		if (curArrayKey->cur_elem != first_elem_dir)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
+			curArrayKey->cur_elem = first_elem_dir;
+			skey->sk_argument = curArrayKey->elem_values[first_elem_dir];
 			changed = true;
 		}
 	}
 
+	if (changed)
+		_bt_update_keys_with_arraykeys(scan);
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
+	 * Invert the scan direction as of the last time the array keys advanced.
 	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * This prevents _bt_steppage from fully trusting currPos.moreRight and
+	 * currPos.moreLeft in cases where _bt_readpage/_bt_checkkeys don't get
+	 * the opportunity to consider advancing the array keys as expected.
 	 */
-	if (changed || !so->arraysStarted)
-	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
+	if (ScanDirectionIsForward(so->advanceDir))
+		so->advanceDir = BackwardScanDirection;
+	else
+		so->advanceDir = ForwardScanDirection;
 }
 
+/*
+ * _bt_tuple_before_array_skeys() -- _bt_checkkeys array helper function
+ *
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) must advance the scan's array keys.
+ * Only call here when _bt_check_compare already set continuescan=false.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfying our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it is now time for our
+ * caller to advance the array keys (unless caller broke the rules by not
+ * checking with _bt_check_compare before calling here).
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * _bt_advance_array_keys and its handling of inequalities for details.
+ *
+ * Note: caller passes _bt_check_compare-set sktrig value to indicate which
+ * scan key triggered the call.  If this is for any scan key that isn't a
+ * required equality strategy scan key, calling here is a no-op, meaning that
+ * we'll invariably return false.  We just accept whatever _bt_check_compare
+ * indicated about the scan when it involves a required inequality scan key.
+ * We never care about nonrequired scan keys, including equality strategy
+ * array scan keys (though _bt_check_compare can temporarily end the scan to
+ * advance their arrays in _bt_advance_array_keys, which we'll never prevent).
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple, int sktrig, bool validtrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData + sktrig, ikey = sktrig;
+		 ikey < so->numberOfKeys;
+		 cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
+
+		/*
+		 * Unlike _bt_check_compare and _bt_advance_array_keys, we never deal
+		 * with inequality strategy scan keys (even those marked required). We
+		 * also don't deal with non-required equality keys -- even when they
+		 * happen to have arrays that might need to be advanced.
+		 *
+		 * Note: cannot "break" here due to corner cases involving redundant
+		 * scan keys that weren't eliminated within _bt_preprocess_keys.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			(cur->sk_flags & SK_BT_REQFWD) == 0)
+			continue;
+
+		/* Required equality scan keys always required in both directions */
+		Assert((cur->sk_flags & SK_BT_REQFWD) &&
+			   (cur->sk_flags & SK_BT_REQBKWD));
+
+		if (attnum > ntupatts)
+		{
+			Assert(!validtrig);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys, forcing another _bt_advance_array_keys call.
+			 *
+			 * You might wonder why we don't treat truncated attributes as
+			 * having values < our equality constraints instead; we're not
+			 * treating the truncated attributes as having -inf values here,
+			 * which is how things are done in _bt_compare.
+			 *
+			 * We're often called during finaltup prechecks, where we help our
+			 * caller to decide whether or not it should terminate the current
+			 * primitive index scan.  Our behavior here implements a policy of
+			 * being slightly optimistic about what will be found on the next
+			 * page when the current primitive scan continues onto that page.
+			 * (This is also closest to what _bt_check_compare does.)
+			 */
+			return false;
+		}
+
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?  (This implements the linear search process
+		 * described in _bt_advance_array_keys.)
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?
+		 */
+		if (validtrig || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconcusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or perhaps a call made from an
+		 * assertion.
+		 */
+		Assert(result == 0);
+		Assert(!validtrig);
+	}
+
+	/*
+	 * Default assumption is that caller must now advance the array keys.
+	 *
+	 * Note that we'll always end up here when sktrig corresponds to some
+	 * non-required array type scan key that _bt_check_compare saw wasn't
+	 * satisfied by caller's tuple.
+	 */
+	return false;
+}
+
+/*
+ * _bt_array_keys_remain() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(so->advanceDir == dir);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  This tells us what to do.  We cannot rely on _bt_first
+	 * always reaching _bt_checkkeys, though.  There are various cases where
+	 * that won't happen.  For example, if the index is completely empty, then
+	 * _bt_first won't get as far as calling _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys...
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * ...though there is one exception: _bt_first's _bt_preprocess_keys
+		 * call can determine that the scan's input scan keys can never be
+		 * satisfied.  That might be true for one set of array keys, but not
+		 * the next set.
+		 *
+		 * Handle this by advancing the array keys incrementally ourselves.
+		 * When this succeeds, start another primitive index scan.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		Assert(!so->needPrimScan);
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+
+		/* Array keys are now exhausted */
+	}
+
+	/*
+	 * Has another primitive index scan been scheduled by _bt_checkkeys?
+	 */
+	if (so->needPrimScan)
+	{
+		/* Yes -- tell caller to call _bt_first once again */
+		so->needPrimScan = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Terminate the top-level scan.
+	 */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Like _bt_check_compare, our return value indicates if tuple satisfied the
+ * qual (specifically our new qual).  There must be a new qual whenever we're
+ * called (unless the top-level scan terminates).  After we return, all later
+ * calls to _bt_check_compare will also use the same new qual (a qual with the
+ * newly advanced array key values that were set here by us).
+ *
+ * We'll also set pstate.continuescan for caller.  When this is set to false,
+ * it usually just ends the ongoing primitive index scan (we'll have scheduled
+ * another one in passing).  But when all required array keys were exhausted,
+ * setting pstate.continuescan=false here ends the top-level index scan (since
+ * no new primitive scan will have been scheduled).  Most calls here will have
+ * us set pstate.continuescan=true, which just indicates that the scan should
+ * proceed onto the next tuple (just like when _bt_check_compare does it).
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys.  Calling here before that
+ * point will prematurely advance the array keys, leading to wrong query
+ * results.
+ *
+ * We're responsible for ensuring that caller's tuple is <= current/newly
+ * advanced required array keys once we return.  We try to find an exact
+ * match, but failing that we'll advance the array keys to whatever set of
+ * array elements comes next in the key space for the current scan direction.
+ * Required array keys "ratchet forwards".  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The invariants are the same for backwards scans, except that the operators
+ * are flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Sometimes that'll
+ * be the sole reason for calling here.  These calls are the only exception to
+ * the general rule about always advancing required array keys (since they're
+ * the only case where we simply don't need to touch any required array, which
+ * must already be satisfied by caller's tuple).  Calls triggered by any scan
+ * key that's required in the current scan direction are strictly guaranteed
+ * to advance the required array keys (or end the top-level scan), though.
+ *
+ * Note that we deal with all required equality strategy scan keys here; it's
+ * not limited to array scan keys.  They're equality constraints for our
+ * purposes, and so are handled as degenerate single element arrays here.
+ * Obviously, they can never really advance in the way that real arrays can,
+ * but they must still affect how we advance real array scan keys, just like
+ * any other equality constraint.  We have to keep around a 3-way ORDER proc
+ * for these (just using the "=" operator won't do), since in general whether
+ * the tuple is < or > some non-array equality key might influence advancement
+ * of any of the scan's actual arrays.  The top-level scan can only terminate
+ * after it has processed the key space covered by the product of each and
+ * every equality constraint, including both non-arrays and (required) arrays.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple according to _bt_check_compare.  This is how we
+ * deal with inequalities that are required in the current scan direction.
+ * They can advance the array keys here, even though they don't influence the
+ * initial positioning strategy within _bt_first (only inequalities required
+ * in the _opposite_ direction to the scan influence _bt_first in this way).
+ *
+ * As discussed already, we guarantee that the array keys will either be
+ * advanced such that caller's tuple is <= the new array keys in respect of
+ * required array keys (plus any other required equality strategy scan keys)
+ * when we return (unless the arrays are totally exhausted instead).  The real
+ * guarantee is actually slightly stronger than that, though it only matters
+ * to scans that have required inequality strategy scan keys.  The precise
+ * promise we make is that the array keys will always advance to the maximum
+ * possible extent that we can know to be safe based on caller's tuple alone.
+ * Note that it's just about possible that every required equality strategy
+ * scan key will be satisfied (or could be satisfied by advancing the array
+ * keys), yet we might advance the array keys _beyond_ our exactly-matching
+ * element values due to a still-unsatisfied inequality strategy scan key.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_exhausted,
+				sktrigrequired = false,
+				beyond_end_advance = false,
+				foundRequiredOppositeDirOnly = false,
+				all_arraylike_sk_satisfied = true,
+				all_required_sk_satisfied = true;
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	/*
+	 * Iterate through the scan's search-type scankeys (so->keyData[]), and
+	 * set input scan keys (so->arrayKeyData[]) to new array values
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno;
+		Datum		tupdatum;
+		bool		requiredSameDir = false,
+					requiredOppositeDirOnly = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		/*
+		 * Set up ORDER 3-way comparison function and array state
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		if (cur->sk_flags & SK_SEARCHARRAY &&
+			cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+			Assert(skeyarray->sk_attno == attnum);
+		}
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			requiredSameDir = true;
+		else if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
+				 ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
+			requiredOppositeDirOnly = true;
+
+		/*
+		 * Optimization: Skip over known-satisfied scan keys
+		 */
+		if (ikey < sktrig)
+			continue;
+		if (ikey == sktrig)
+			sktrigrequired = requiredSameDir;
+
+		/*
+		 * When we come across an inequality scan key that's required in the
+		 * opposite direction only, and that might affect where our scan ends,
+		 * remember it.  We'll only need this information when all prior
+		 * equality constraints are satisfied.
+		 */
+		if (requiredOppositeDirOnly && sktrigrequired &&
+			all_arraylike_sk_satisfied)
+		{
+			Assert(cur->sk_strategy != BTEqualStrategyNumber);
+			Assert(all_required_sk_satisfied);
+			Assert(!foundRequiredOppositeDirOnly);
+
+			foundRequiredOppositeDirOnly = true;
+
+			continue;
+		}
+
+		/*
+		 * Other than that, we're not interested in scan keys that aren't
+		 * required in the current scan direction (unless they're non-required
+		 * array equality scan keys, which still need to be advanced by us)
+		 */
+		if (!requiredSameDir && !array)
+			continue;
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now. (We only do comparisons of any required
+		 * non-array equality scan keys that come after the triggering key.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; it knows all the intricacies around
+		 * evaluating inequality strategy scan keys (e.g., row comparisons).
+		 * There is no simple mapping onto the opclass ORDER proc we can use.
+		 * But once we know that we have an unsatisfied inequality, we can
+		 * treat it in the same way as an unsatisfied equality at this point.
+		 * (We don't need to worry about later required inequalities, since
+		 * there can't be any after the first one.  While it's possible that
+		 * _bt_preprocess_keys failed to determine which of several "required"
+		 * scan keys for this same attribute and same scan direction are truly
+		 * required, that changes nothing, really.  Even in this corner case,
+		 * we can safely assume that any other "required" inequality that is
+		 * still satisfied must have been redundant all along.)
+		 *
+		 * The arrays advance correctly in both cases because both involve the
+		 * scan reaching the end of the key space for a higher order array key
+		 * (or some distinct set of higher-order array keys, taken together).
+		 * The only real difference is that in the equality case the end is
+		 * "strictly at the end of an array key", whereas in the inequality
+		 * case it's "within an array key".  Either way we'll increment higher
+		 * order arrays by one increment (the next-highest array might need to
+		 * roll over to the next-next highest array in turn, and so on).
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(requiredSameDir);
+			Assert(!arrays_advanced);
+
+			beyond_end_advance = true;
+			all_arraylike_sk_satisfied = all_required_sk_satisfied = false;
+
+			continue;
+		}
+
+		/*
+		 * Nothing for us to do with a required inequality strategy scan key
+		 * that wasn't the one that _bt_check_compare stopped on
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 *
+		 * NB: We must do this for all arrays -- not just required arrays.
+		 * Otherwise the incremental array advancement step won't "carry".
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				skeyarray->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * required scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_arraylike_sk_satisfied || attnum > ntupatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			/*
+			 * Truncated -inf value will always be assumed to satisfy any
+			 * required equality scan keys according to _bt_check_compare.
+			 * This avoids a later _bt_check_compare recheck.
+			 *
+			 * Deliberately don't unset all_required_sk_satisfied here.  This
+			 * follows _bt_tuple_before_array_skeys's example.  We don't want
+			 * to treat -inf as a non-match when making a final decision on
+			 * whether to move to the next page.  This implements a policy of
+			 * being optimistic about finding real matches for lower-order
+			 * required attributes that are truncated to -inf in finaltup.
+			 */
+			all_arraylike_sk_satisfied = false;
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		ratchets = (requiredSameDir && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+											  tupdatum, tupnull,
+											  array, cur, &result);
+
+			/*
+			 * Required arrays only ever ratchet forwards (backwards).
+			 *
+			 * This condition makes it safe for binary searches to skip over
+			 * array elements that the scan must already be ahead of by now.
+			 * That is strictly an optimization.  Our assertion verifies that
+			 * the condition holds, which doesn't depend on the optimization.
+			 */
+			Assert(!ratchets ||
+				   ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+					(ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(requiredSameDir);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array.
+			 *
+			 * We really do need an ORDER proc for this (we can't just rely on
+			 * the scan key's equality operator).  We need to know whether the
+			 * tuple as a whole is either behind or ahead of (or covered by)
+			 * the key space represented by our required arrays as a group.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling of required equality scan keys
+		 * that lack a real array for us to advance, either.  It also won't
+		 * matter if all of the scan's real arrays are non-required arrays.
+		 */
+		if (requiredSameDir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		/*
+		 * Also track whether all relevant attributes from caller's tuple will
+		 * be equal to the scan's array keys once we're done with it
+		 */
+		if (result != 0)
+		{
+			all_arraylike_sk_satisfied = false;
+			if (requiredSameDir)
+				all_required_sk_satisfied = false;
+		}
+
+		/*
+		 * Optimization: If this call was triggered by a non-required array,
+		 * and we know that tuple won't satisfy the qual, we give up right
+		 * away.  This often avoids advancing the array keys, which will save
+		 * wasted cycles from calling _bt_update_keys_with_arraykeys below.
+		 */
+		if (!all_arraylike_sk_satisfied && !sktrigrequired)
+		{
+			Assert(!requiredSameDir && !foundRequiredOppositeDirOnly);
+			Assert(!beyond_end_advance);
+
+			break;
+		}
+
+		/* Advance array keys, even if set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+		}
+	}
+
+	/*
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is the only way that
+	 * the top-level index scan can be terminated here by us.
+	 */
+	arrays_exhausted = false;
+	if (beyond_end_advance)
+	{
+		/* Non-required scan keys never exhaust arrays/end top-level scan */
+		Assert(sktrigrequired && !all_required_sk_satisfied);
+
+		if (!_bt_advance_array_keys_increment(scan, dir))
+			arrays_exhausted = true;
+		else
+			arrays_advanced = true;
+	}
+
+	if (arrays_advanced)
+	{
+		/*
+		 * We advanced the array keys.  Finalize everything by performing an
+		 * in-place update of the scan's search-type scan keys.
+		 *
+		 * If we missed this final step then any call to _bt_check_compare
+		 * would use stale array keys until such time as _bt_preprocess_keys
+		 * was once again called by _bt_first.
+		 */
+		_bt_update_keys_with_arraykeys(scan);
+		so->advanceDir = dir;
+
+		/*
+		 * If any required array keys were advanced, be prepared to recheck
+		 * the final tuple against the new array keys (as an optimization)
+		 */
+		if (sktrigrequired)
+			pstate->finaltupchecked = false;
+	}
+
+	/*
+	 * If the array keys are now exhausted, end the top-level index scan
+	 */
+	Assert(!so->needPrimScan);
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	if (arrays_exhausted)
+	{
+		Assert(sktrigrequired && !all_required_sk_satisfied);
+
+		pstate->continuescan = false;
+
+		/* Caller's tuple can't match new qual (if any), either */
+		return false;
+	}
+
+	/*
+	 * Postcondition assertions (see header comments for a full explanation).
+	 *
+	 * Tuple must now be <= current/newly advanced required array keys.  Same
+	 * goes for other required equality type scan keys, which are "degenerate
+	 * single value arrays" for our purposes.  (As usual the rule is the same
+	 * for backwards scans once the operators are flipped around.)
+	 *
+	 * Every call here is guaranteed to advance (or exhaust) all required
+	 * arrays, with the sole exception of calls _bt_check_compare triggers
+	 * when it encounters an unsatisfied non-required array scan key.
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, pstate, tuple, 0, false) ==
+		   !all_required_sk_satisfied);
+	Assert(arrays_advanced || !sktrigrequired);
+	Assert(sktrigrequired || all_required_sk_satisfied);
+
+	/*
+	 * The array keys aren't exhausted, so provisionally assume that the
+	 * current primitive index scan will continue
+	 */
+	pstate->continuescan = true;
+
+	/*
+	 * Does caller's tuple now match the new qual?  Call _bt_check_compare a
+	 * second time to find out (unless it's already clear that it can't).
+	 */
+	if (all_arraylike_sk_satisfied && arrays_advanced)
+	{
+		bool		continuescan;
+		int			insktrig = sktrig + 1;
+
+		if (likely(_bt_check_compare(dir, so, tuple, ntupatts, itupdesc,
+									 so->numArrayKeys, &continuescan,
+									 &insktrig, false, false)))
+			return true;
+
+		/*
+		 * Handle inequalities marked required in the current scan direction.
+		 *
+		 * It's just about possible that our _bt_check_compare call indicates
+		 * that the scan should be terminated due to an unsatisfied inequality
+		 * that wasn't initially recognized as such by us.  Handle this by
+		 * calling ourselves recursively while indicating that the trigger is
+		 * now the inequality that we missed first time around.
+		 *
+		 * Note: we only need to do this in cases where the initial call to
+		 * _bt_check_compare (that led to calling here) gave up upon finding
+		 * an unsatisfied required equality/array scan key before it could
+		 * reach the inequality.  The second _bt_check_compare call took place
+		 * after the array keys were advanced (to array keys that definitely
+		 * match the tuple), so it can't have been overlooked a second time.
+		 *
+		 * Note: this is useful because we won't have to wait until the next
+		 * tuple to advance the array keys a second time (to values that'll
+		 * put the scan ahead of this tuple).  Handling this ourselves isn't
+		 * truly required.  But it avoids complicating our contract.  The only
+		 * alternative is to allow an awkward exception to the general rule
+		 * (the rule about always advancing the arrays to the maximum possible
+		 * extent that caller's tuple can safely allow).
+		 */
+		if (!continuescan)
+		{
+			ScanKey		inequal PG_USED_FOR_ASSERTS_ONLY = so->keyData + insktrig;
+
+			Assert(sktrigrequired && all_required_sk_satisfied);
+			Assert(inequal->sk_strategy != BTEqualStrategyNumber);
+			Assert(((inequal->sk_flags & SK_BT_REQFWD) &&
+					ScanDirectionIsForward(dir)) ||
+				   ((inequal->sk_flags & SK_BT_REQBKWD) &&
+					ScanDirectionIsBackward(dir)));
+
+			return _bt_advance_array_keys(scan, pstate, tuple, insktrig);
+		}
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 *
+	 * If we advanced the array keys (which is now certain except in the case
+	 * where we only needed to deal with non-required arrays), it's possible
+	 * that the scan is now at the start of "matching" tuples (at least by the
+	 * definition used by _bt_tuple_before_array_skeys), but is nevertheless
+	 * still many leaf pages before the position that _bt_first is capable of
+	 * repositioning the scan to.
+	 *
+	 * This can happen when we have an inequality scan key required in the
+	 * opposite direction only, that's less significant than the scan key that
+	 * triggered array advancement during our initial _bt_check_compare call.
+	 * If even finaltup doesn't satisfy this less significant inequality scan
+	 * key once we temporarily flip the scan direction, that indicates that
+	 * even finaltup is before the _bt_first-wise initial position for these
+	 * newly advanced array keys.
+	 */
+	if (all_required_sk_satisfied && foundRequiredOppositeDirOnly &&
+		pstate->finaltup)
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped = -dir;
+		bool		continuescan;
+		int			opsktrig = 0;
+
+		Assert(sktrigrequired && arrays_advanced);
+
+		_bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+						  itupdesc, so->numArrayKeys, &continuescan,
+						  &opsktrig, false, false);
+
+		if (!continuescan && opsktrig > sktrig)
+		{
+			ScanKey		inequal = so->keyData + opsktrig;
+
+			if (((inequal->sk_flags & SK_BT_REQFWD) &&
+				 ScanDirectionIsForward(flipped)) ||
+				((inequal->sk_flags & SK_BT_REQBKWD) &&
+				 ScanDirectionIsBackward(flipped)))
+			{
+				Assert(inequal->sk_strategy != BTEqualStrategyNumber);
+
+				/*
+				 * Continuing the ongoing primitive index scan as-is risks
+				 * uselessly scanning a huge number of leaf pages from before
+				 * the page that we'll quickly jump to by descending the index
+				 * anew.
+				 *
+				 * Play it safe: start a new primitive index scan.  _bt_first
+				 * is guaranteed to at least move the scan to the next leaf
+				 * page.
+				 */
+				pstate->continuescan = false;
+				so->needPrimScan = true;
+
+				return false;
+			}
+		}
+
+		/*
+		 * Caller's tuple might still be before the _bt_first-wise start of
+		 * matches for the new array keys, but at least finaltup is at or
+		 * ahead of that position.  That's good enough; continue as-is.
+		 */
+	}
+
+	/*
+	 * Caller's tuple is < the newly advanced array keys (or > when this is a
+	 * backwards scan).
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.  Our caller should locate
+	 * the first tuple >= the array keys before long (or locate the first
+	 * tuple <= the array keys before long).
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will either quickly discover the next tuple
+	 * covered by the current array keys, or quickly discover that it needs
+	 * another primitive index scan (using its finaltup precheck) instead.
+	 * Either way, a binary search is unlikely to beat a simple linear search.
+	 *
+	 * It's also not clear that a binary search will be any faster when we
+	 * really do have to search through hundreds of tuples beyond this one.
+	 * Several binary searches (one per array advancement) might be required
+	 * while reading through a single page.  Our linear search is structured
+	 * as one continuous search that just advances the arrays in passing, and
+	 * that only needs a little extra logic to deal with inequality scan keys.
+	 */
+	if (!all_required_sk_satisfied && tuple == pstate->finaltup)
+	{
+		/*
+		 * There is one exception: when the page's final tuple advances the
+		 * array keys without exactly matching keys for any required arrays,
+		 * start a new primitive index scan -- don't let our caller continue
+		 * to the next leaf page.
+		 *
+		 * In the forward scan case, finaltup is the page high key.  We don't
+		 * insist on having an exact match for truncated -inf attributes.
+		 * They're never exactly equal to any real array key, but it makes
+		 * sense to be optimistic about finding matches on the next page.
+		 */
+		Assert(sktrigrequired && arrays_advanced);
+
+		pstate->continuescan = false;
+		so->needPrimScan = true;
+	}
+
+	/* In any case, this indextuple doesn't match the qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -741,6 +1994,19 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here.  They
+ * expect to be able to call _bt_update_keys_with_arraykeys instead.  We need
+ * to be careful about that case when we determine redundancy; equality quals
+ * must not be eliminated as redundant on the basis of array input keys that
+ * might change before another call here can take place.
+ *
+ * Note, however, that the presence of an array scan key doesn't affect how we
+ * determine if index quals are contradictory.  Contradictory qual scans move
+ * on to the next primitive index scan right away, by incrementing the scan's
+ * array keys once control reaches _bt_array_keys_remain.  There won't be a
+ * call to _bt_update_keys_with_arraykeys, so there's nothing for us to break.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -887,8 +2153,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							so->qual_ok = false;
 							return;
 						}
-						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						else if (!(eq->sk_flags & SK_SEARCHARRAY))
+						{
+							/* else discard the redundant non-equality key */
+							xform[j] = NULL;
+						}
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -978,6 +2247,22 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
+		/*
+		 * Is this an array scan key that _bt_preprocess_array_keys merged
+		 * with some earlier array key during its initial preprocessing pass?
+		 */
+		if (cur->sk_flags & SK_BT_RDDNARRAY)
+		{
+			/*
+			 * key is redundant for this primitive index scan (and will be
+			 * redundant during all subsequent primitive index scans)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(j == (BTEqualStrategyNumber - 1));
+			Assert(so->numArrayKeys > 0);
+			continue;
+		}
+
 		/* have we seen one of these before? */
 		if (xform[j] == NULL)
 		{
@@ -991,7 +2276,26 @@ _bt_preprocess_keys(IndexScanDesc scan)
 										 &test_result))
 			{
 				if (test_result)
-					xform[j] = cur;
+				{
+					if (j == (BTEqualStrategyNumber - 1) &&
+						((xform[j]->sk_flags & SK_SEARCHARRAY) ||
+						 (cur->sk_flags & SK_SEARCHARRAY)))
+					{
+						/*
+						 * Must never replace an = array operator ourselves,
+						 * nor can we ever fail to remember an = array
+						 * operator.  _bt_update_keys_with_arraykeys expects
+						 * this.
+						 */
+						ScanKey		outkey = &outkeys[new_numberOfKeys++];
+
+						memcpy(outkey, cur, sizeof(ScanKeyData));
+						if (numberOfEqualCols == attno - 1)
+							_bt_mark_scankey_required(outkey);
+					}
+					else
+						xform[j] = cur;
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1019,6 +2323,95 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+/*
+ *	_bt_update_keys_with_arraykeys() -- Finalize advancing array keys
+ *
+ * Transfers newly advanced array keys that were set in "so->arrayKeyData[]"
+ * over to corresponding "so->keyData[]" scan keys.  Reuses most of the work
+ * that took place within _bt_preprocess_keys, only changing the array keys.
+ *
+ * It's safe to call here while holding a buffer lock, which isn't something
+ * that _bt_preprocess_keys can guarantee.
+ */
+static void
+_bt_update_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	Assert(so->qual_ok);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		Assert((cur->sk_flags & SK_BT_RDDNARRAY) == 0);
+
+		/* Just update equality array scan keys */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Update the scan key's argument */
+		Assert(cur->sk_attno == skeyarray->sk_attno);
+		cur->sk_argument = skeyarray->sk_argument;
+	}
+
+	Assert(arrayidx == so->numArrayKeys);
+}
+
+/*
+ * Verify that the scan's "so->arrayKeyData[]" scan keys are in agreement with
+ * the current "so->keyData[]" search-type scan keys.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Verify so->arrayKeyData[] input key has expected sk_argument */
+		if (skeyarray->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+
+		/* Verify so->arrayKeyData[] input key agrees with output key */
+		if (cur->sk_attno != skeyarray->sk_attno)
+			return false;
+		if (cur->sk_argument != skeyarray->sk_argument)
+			return false;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1352,60 +2745,211 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers call with a high key tuple last in the hopes of having
+ * us set pstate.continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.  Pass finaltup=true for these high key calls.
+ * Backwards scan callers shouldn't do this, but should still let us know
+ * which tuple is last by passing finaltup=true for the final non-pivot tuple
+ * (the non-pivot tuple at page offset number one).
+ *
+ * Callers with equality strategy array scan keys must set up page state that
+ * helps us know when to start or stop primitive index scans on their behalf.
+ * The finaltup tuple should be stashed in pstate.finaltup, so we don't have
+ * to wait until the finaltup call to be able to see what's up with the page.
+ *
+ * Advances the scan's array keys in passing when required.  Note that we rely
+ * on _bt_readpage calling here in page offset number order (for the current
+ * scan direction).  Any other order confuses array advancement.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
+ * finaltup: Is tuple the final one we'll be called with for this page?
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
+ * continuescanPrechecked: indicates that continuescan flag is known to
  * 						   be true for the last item on the page
  * haveFirstMatch: indicates that we already have at least one match
  * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup, int tupnatts,
 			  bool continuescanPrechecked, bool haveFirstMatch)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			numArrayKeys = so->numArrayKeys;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!numArrayKeys || so->advanceDir == pstate->dir);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so, tuple, tupnatts, tupdesc,
+							numArrayKeys, &pstate->continuescan, &ikey,
+							continuescanPrechecked, haveFirstMatch);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * continuescan=false.
+	 */
+	if (!numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.
+	 *
+	 * While we might really need to end the top-level index scan, most of the
+	 * time this just means that the scan needs to reconsider its array keys.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, pstate, tuple, ikey, true))
+	{
+		/*
+		 * Current tuple is < the current array scan keys/equality constraints
+		 * (or > in the backward scan case).  Don't need to advance the array
+		 * keys.  Must decide whether to start a new primitive scan instead.
+		 *
+		 * If this tuple isn't the finaltup for the page, then recheck the
+		 * finaltup stashed in pstate as an optimization.  That allows us to
+		 * quit scanning this page early when it's clearly hopeless (we don't
+		 * need to wait for the finaltup call to give up on a primitive scan).
+		 */
+		if (finaltup || (!pstate->finaltupchecked && pstate->finaltup &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->finaltup,
+													  0, false)))
+		{
+			/*
+			 * Give up on the ongoing primitive index scan.
+			 *
+			 * Even the final tuple (the high key for forward scans, or the
+			 * tuple from page offset number 1 for backward scans) is before
+			 * the current array keys.  That strongly suggests that continuing
+			 * this primitive scan would be less efficient than starting anew.
+			 *
+			 * See also: _bt_advance_array_keys's handling of the case where
+			 * finaltup itself advances the array keys to non-matching values.
+			 */
+			pstate->continuescan = false;
+
+			/*
+			 * Set up a new primitive index scan that will reposition the
+			 * top-level scan to the first leaf page whose key space is
+			 * covered by our array keys.  The top-level scan will "skip" a
+			 * part of the index that can only contain non-matching tuples.
+			 *
+			 * Note: the next primitive index scan is guaranteed to land on
+			 * some later leaf page (ideally it won't be this page's sibling).
+			 * It follows that the top-level scan can never access the same
+			 * leaf page more than once (unless the scan changes direction or
+			 * btrestrpos is called).  btcostestimate relies on this.
+			 */
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/*
+			 * Stick with the ongoing primitive index scan, for now (override
+			 * _bt_check_compare's suggestion that we end the scan).
+			 *
+			 * Note: we will end up here again and again given a group of
+			 * tuples > the previous array keys and < the now-current keys
+			 * (though only after an initial finaltup precheck determined that
+			 * this page definitely covers key space from both array keysets).
+			 * In effect, we perform a linear search of the page's remaining
+			 * unscanned tuples every time the arrays advance past the key
+			 * space of the scan's then-current tuple.
+			 */
+			pstate->continuescan = true;
+
+			/*
+			 * Our finaltup precheck determined that it is >= the current keys
+			 * (though the _current_ tuple is still < the current array keys).
+			 *
+			 * Remember that fact in pstate now.  This avoids wasting cycles
+			 * on repeating the same precheck step (checking the same finaltup
+			 * against the same array keys) during later calls here for later
+			 * tuples from this same leaf page.
+			 */
+			pstate->finaltupchecked = true;
+		}
+
+		/* In any case, this indextuple doesn't match the qual */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (unless the required array keys become exhausted instead, or
+	 * unless the ikey trigger corresponds to a non-required array scan key).
+	 *
+	 * Note: we might even advance the required arrays when all existing keys
+	 * are already equal to the values from the tuple at this point.  See the
+	 * comments above _bt_advance_array_keys about required-inequality-driven
+	 * array advancement.
+	 *
+	 * Note: we _won't_ advance any required arrays when the ikey/trigger scan
+	 * key corresponds to a non-required array found to be unsatisfied by the
+	 * current keys.  (We might not even "advance" the non-required array.)
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, ikey);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (which has more high
+ * level context than us) to override that initial determination when it makes
+ * more sense to advance the array keys and continue with further tuples from
+ * the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  int numArrayKeys, bool *continuescan, int *ikey,
+				  bool continuescanPrechecked, bool haveFirstMatch)
+{
+	ScanKey		key;
+	int			keysz;
+
+	Assert(!numArrayKeys || !continuescanPrechecked);
 
 	*continuescan = true;		/* default assumption */
-
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
 	keysz = so->numberOfKeys;
 
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = so->keyData + *ikey; *ikey < keysz; key++, (*ikey)++)
 	{
 		Datum		datum;
 		bool		isNull;
 		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1423,7 +2967,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
+		if ((requiredSameDir || (requiredOppositeDirOnly && haveFirstMatch)) &&
 			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
 			continue;
 
@@ -1435,7 +2979,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1525,12 +3068,29 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key checking function.  When the key is required for
+		 * opposite-direction scans it must be an inequality satisfied by
+		 * _bt_first(), barring NULLs, which we just checked a moment ago.
+		 *
+		 * (Also can't apply this optimization with scans that use arrays,
+		 * since _bt_advance_array_keys() sometimes allows the scan to see a
+		 * few tuples from before the would-be _bt_first() starting position
+		 * for the scan's just-advanced array keys.)
+		 *
+		 * Even required equality quals (that can't use this optimization due
+		 * to being required in both scan directions) rely on the assumption
+		 * that _bt_first() will always use the quals for initial positioning
+		 * purposes.  We stop the scan as soon as any required equality qual
+		 * fails, so it had better only happen at the end of equal tuples in
+		 * the current scan direction (never at the start of equal tuples).
+		 * See comments in _bt_first().
+		 *
+		 * (The required equality quals issue also has specific implications
+		 * for scans that use arrays.  They sometimes perform a linear search
+		 * of remaining unscanned tuples, forcing the primitive index scan to
+		 * continue until it locates tuples >= the scan's new array keys.)
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
+		if (!(requiredOppositeDirOnly && haveFirstMatch) || numArrayKeys)
 		{
 			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
 									 datum, key->sk_argument);
@@ -1548,15 +3108,25 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * Tuple fails this qual.  If it's a required qual for the current
 			 * scan direction, then we can conclude no further tuples will
 			 * pass, either.
-			 *
-			 * Note: because we stop the scan as soon as any required equality
-			 * qual fails, it is critical that equality quals be used for the
-			 * initial positioning in _bt_first() when they are available. See
-			 * comments in _bt_first().
 			 */
 			if (requiredSameDir)
 				*continuescan = false;
 
+			/*
+			 * Always set continuescan=false for equality-type array keys that
+			 * don't pass -- even for an array scan key not marked required.
+			 *
+			 * A non-required scan key (array or otherwise) can never actually
+			 * terminate the scan.  It's just convenient for callers to treat
+			 * continuescan=false as a signal that it might be time to advance
+			 * the array keys, independent of whether they're required or not.
+			 * (Even setting continuescan=false with a required scan key won't
+			 * usually end a scan that uses arrays.)
+			 */
+			if (numArrayKeys && (key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+				*continuescan = false;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1575,7 +3145,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1604,7 +3174,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 03a5fbdc6..e37597c26 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +862,20 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			/*
+			 * We might need to omit ScalarArrayOpExpr clauses when index AM
+			 * lacks native support
+			 */
+			if (!index->amsearcharray && IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
+				if (skip_nonnative_saop)
 				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
+					/* Ignore because not supported by index */
+					*skip_nonnative_saop = true;
+					continue;
 				}
+				/* Caller had better intend this only for bitmap scan */
+				Assert(scantype == ST_BITMAPSCAN);
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e11d02282..f96c7b5dc 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6514,8 +6514,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6530,7 +6528,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6560,19 +6558,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6610,27 +6597,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6640,11 +6631,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6660,10 +6649,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6676,7 +6663,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6754,7 +6741,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6769,17 +6755,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6819,14 +6800,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				int			alength = estimate_array_length(other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6886,13 +6862,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6902,6 +6871,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6909,9 +6920,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6928,7 +6939,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b804eb8b5..0dd80cc71 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4062,6 +4062,21 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 8311a03c3..d159091ab 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -434,3 +434,482 @@ ALTER INDEX btree_part_idx ALTER COLUMN id SET (n_distinct=100);
 ERROR:  ALTER action ALTER COLUMN ... SET cannot be performed on relation "btree_part_idx"
 DETAIL:  This operation is not supported for partitioned indexes.
 DROP TABLE btree_part;
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.
+create temp table backup_wrong_tbl (district int4, warehouse int4, orderid int4, orderline int4);
+create index backup_wrong_idx on backup_wrong_tbl (district, warehouse, orderid, orderline);
+insert into backup_wrong_tbl
+select district, warehouse, orderid, orderline
+from
+  generate_series(1, 3) district,
+  generate_series(1, 2) warehouse,
+  generate_series(1, 51) orderid,
+  generate_series(1, 10) orderline;
+begin;
+declare back_up_terminate_toplevel_wrong cursor for
+select * from backup_wrong_tbl
+where district in (1, 3) and warehouse in (1,2)
+and orderid in (48, 50)
+order by district, warehouse, orderid, orderline;
+fetch forward 60 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      48 |         1
+        1 |         1 |      48 |         2
+        1 |         1 |      48 |         3
+        1 |         1 |      48 |         4
+        1 |         1 |      48 |         5
+        1 |         1 |      48 |         6
+        1 |         1 |      48 |         7
+        1 |         1 |      48 |         8
+        1 |         1 |      48 |         9
+        1 |         1 |      48 |        10
+        1 |         1 |      50 |         1
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         6
+        3 |         1 |      48 |         7
+        3 |         1 |      48 |         8
+        3 |         1 |      48 |         9
+        3 |         1 |      48 |        10
+        3 |         1 |      50 |         1
+        3 |         1 |      50 |         2
+        3 |         1 |      50 |         3
+        3 |         1 |      50 |         4
+        3 |         1 |      50 |         5
+        3 |         1 |      50 |         6
+        3 |         1 |      50 |         7
+        3 |         1 |      50 |         8
+        3 |         1 |      50 |         9
+        3 |         1 |      50 |        10
+(60 rows)
+
+fetch backward 29 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      50 |         9
+        3 |         1 |      50 |         8
+        3 |         1 |      50 |         7
+        3 |         1 |      50 |         6
+        3 |         1 |      50 |         5
+        3 |         1 |      50 |         4
+        3 |         1 |      50 |         3
+        3 |         1 |      50 |         2
+        3 |         1 |      50 |         1
+        3 |         1 |      48 |        10
+        3 |         1 |      48 |         9
+        3 |         1 |      48 |         8
+        3 |         1 |      48 |         7
+        3 |         1 |      48 |         6
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+(29 rows)
+
+fetch forward 12 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+(12 rows)
+
+fetch backward 30 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+(30 rows)
+
+fetch forward  31 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+(31 rows)
+
+fetch backward 32 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         2
+(32 rows)
+
+fetch forward  33 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+(33 rows)
+
+fetch backward 34 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         1
+(34 rows)
+
+fetch forward  35 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         6
+(35 rows)
+
+commit;
+create temp table outer_table                  (a int, b int);
+create temp table restore_buggy_primscan_table (x int, y int);
+create index buggy_idx on restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+insert into outer_table                  select 1, b_vals from generate_series(1006, 1580) b_vals;
+insert into restore_buggy_primscan_table select 1, x_vals from generate_series(1006, 1580) x_vals;
+insert into outer_table                  select 1, 1370 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1371 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1380 from generate_series(1, 9) j;
+vacuum analyze outer_table;
+vacuum analyze restore_buggy_primscan_table;
+select count(*), o.a, o.b
+  from
+    outer_table o
+  inner join
+    restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x = 1 and
+  bug.y = any(array[(select array_agg(i) from generate_series(1370, 1390) i where i % 10 = 0)])
+group by o.a, o.b;
+ count | a |  b   
+-------+---+------
+    10 | 1 | 1370
+    10 | 1 | 1380
+     1 | 1 | 1390
+(3 rows)
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys().  This is handled like the case where the scan
+-- direction changes "within" a page, relying on code from _bt_readnextpage().
+create temp table outer_tab(
+  a int,
+  b int
+);
+create index outer_tab_idx on outer_tab(a, b) with (deduplicate_items = off);
+create temp table primscanmarkcov_table(
+  a int,
+  b int
+);
+create index interesting_coverage_idx on primscanmarkcov_table(a, b) with (deduplicate_items = off);
+insert into outer_tab             select 1, i from generate_series(1530, 1780) i;
+insert into primscanmarkcov_table select 1, i from generate_series(1530, 1780) i;
+insert into outer_tab             select 1, 1550 from generate_series(1, 200) i;
+insert into primscanmarkcov_table select 1, 1551 from generate_series(1, 200) i;
+vacuum analyze outer_tab;
+vacuum analyze primscanmarkcov_table ;
+with range_ints as ( select i from generate_series(1530, 1780) i)
+select
+  count(*), buggy.a, buggy.b from
+outer_tab o
+  inner join
+primscanmarkcov_table buggy
+  on o.a = buggy.a and o.b = buggy.b
+where
+  o.a = 1     and     o.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])  and
+  buggy.a = 1 and buggy.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])
+group by buggy.a, buggy.b
+order by buggy.a, buggy.b;
+ count | a |  b   
+-------+---+------
+   201 | 1 | 1550
+     1 | 1 | 1600
+     1 | 1 | 1650
+     1 | 1 | 1700
+     1 | 1 | 1750
+(5 rows)
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys() for backwards scans.  More or less comparable to
+-- the last test.
+create temp table backwards_prim_outer_table             (a int, b int);
+create temp table backwards_restore_buggy_primscan_table (x int, y int);
+create index backward_prim_buggy_idx  on backwards_restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+create index backwards_prim_drive_idx on backwards_prim_outer_table             (a, b) with (deduplicate_items=off);
+insert into backwards_prim_outer_table                  select 0, 1360;
+insert into backwards_prim_outer_table                  select 1, b_vals from generate_series(1012, 1406) b_vals where b_vals % 10 = 0;
+insert into backwards_prim_outer_table                  select 1, 1370;
+vacuum analyze backwards_prim_outer_table; -- Be tidy
+-- Fill up "backwards_prim_drive_idx" index with 396 items, just about fitting
+-- onto its only page, which is a root leaf page:
+insert into backwards_restore_buggy_primscan_table select 0, 1360;
+insert into backwards_restore_buggy_primscan_table select 1, x_vals from generate_series(1012, 1406) x_vals;
+vacuum analyze backwards_restore_buggy_primscan_table; -- Be tidy
+-- Now cause two page splits, leaving 4 leaf pages in total:
+insert into backwards_restore_buggy_primscan_table select 1, 1370 from generate_series(1,250) i;
+-- Now "buggy" index looks like this:
+--
+-- ┌───┬───────┬───────┬────────┬────────┬────────────┬───────┬───────┬───────────────────┬─────────┬───────────┬──────────────────┐
+-- │ i │ blkno │ flags │ nhtids │ nhblks │ ndeadhblks │ nlive │ ndead │ nhtidschecksimple │ avgsize │ freespace │     highkey      │
+-- ├───┼───────┼───────┼────────┼────────┼────────────┼───────┼───────┼───────────────────┼─────────┼───────────┼──────────────────┤
+-- │ 1 │     1 │     1 │    203 │      1 │          0 │   204 │     0 │                 0 │      16 │     4,068 │ (x, y)=(1, 1214) │
+-- │ 2 │     4 │     1 │    156 │      2 │          0 │   157 │     0 │                 0 │      16 │     5,008 │ (x, y)=(1, 1370) │
+-- │ 3 │     5 │     1 │    251 │      2 │          0 │   252 │     0 │                 0 │      16 │     3,108 │ (x, y)=(1, 1371) │
+-- │ 4 │     2 │     1 │     36 │      1 │          0 │    36 │     0 │                 0 │      16 │     7,428 │ ∅                │
+-- └───┴───────┴───────┴────────┴────────┴────────────┴───────┴───────┴───────────────────┴─────────┴───────────┴──────────────────┘
+select count(*), o.a, o.b
+  from
+    backwards_prim_outer_table o
+  inner join
+    backwards_restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x in (0, 1) and
+  bug.y = any(array[(select array_agg(i) from generate_series(1360, 1370) i where i % 10 = 0)])
+group by o.a, o.b
+order by o.a desc, o.b desc;
+ count | a |  b   
+-------+---+------
+   502 | 1 | 1370
+     1 | 1 | 1360
+     1 | 0 | 1360
+(3 rows)
+
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 446cfa678..b50409c7f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1951,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 1557e1729..662947e44 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8712,10 +8712,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index ef8435423..330edbb1d 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -267,3 +267,150 @@ CREATE TABLE btree_part (id int4) PARTITION BY RANGE (id);
 CREATE INDEX btree_part_idx ON btree_part(id);
 ALTER INDEX btree_part_idx ALTER COLUMN id SET (n_distinct=100);
 DROP TABLE btree_part;
+
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.
+create temp table backup_wrong_tbl (district int4, warehouse int4, orderid int4, orderline int4);
+create index backup_wrong_idx on backup_wrong_tbl (district, warehouse, orderid, orderline);
+insert into backup_wrong_tbl
+select district, warehouse, orderid, orderline
+from
+  generate_series(1, 3) district,
+  generate_series(1, 2) warehouse,
+  generate_series(1, 51) orderid,
+  generate_series(1, 10) orderline;
+
+begin;
+declare back_up_terminate_toplevel_wrong cursor for
+select * from backup_wrong_tbl
+where district in (1, 3) and warehouse in (1,2)
+and orderid in (48, 50)
+order by district, warehouse, orderid, orderline;
+
+fetch forward 60 from back_up_terminate_toplevel_wrong;
+fetch backward 29 from back_up_terminate_toplevel_wrong;
+fetch forward 12 from back_up_terminate_toplevel_wrong;
+fetch backward 30 from back_up_terminate_toplevel_wrong;
+fetch forward  31 from back_up_terminate_toplevel_wrong;
+fetch backward 32 from back_up_terminate_toplevel_wrong;
+fetch forward  33 from back_up_terminate_toplevel_wrong;
+fetch backward 34 from back_up_terminate_toplevel_wrong;
+fetch forward  35 from back_up_terminate_toplevel_wrong;
+commit;
+
+create temp table outer_table                  (a int, b int);
+create temp table restore_buggy_primscan_table (x int, y int);
+
+create index buggy_idx on restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+
+insert into outer_table                  select 1, b_vals from generate_series(1006, 1580) b_vals;
+insert into restore_buggy_primscan_table select 1, x_vals from generate_series(1006, 1580) x_vals;
+
+insert into outer_table                  select 1, 1370 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1371 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1380 from generate_series(1, 9) j;
+
+vacuum analyze outer_table;
+vacuum analyze restore_buggy_primscan_table;
+
+select count(*), o.a, o.b
+  from
+    outer_table o
+  inner join
+    restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x = 1 and
+  bug.y = any(array[(select array_agg(i) from generate_series(1370, 1390) i where i % 10 = 0)])
+group by o.a, o.b;
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys().  This is handled like the case where the scan
+-- direction changes "within" a page, relying on code from _bt_readnextpage().
+create temp table outer_tab(
+  a int,
+  b int
+);
+create index outer_tab_idx on outer_tab(a, b) with (deduplicate_items = off);
+
+create temp table primscanmarkcov_table(
+  a int,
+  b int
+);
+create index interesting_coverage_idx on primscanmarkcov_table(a, b) with (deduplicate_items = off);
+
+insert into outer_tab             select 1, i from generate_series(1530, 1780) i;
+insert into primscanmarkcov_table select 1, i from generate_series(1530, 1780) i;
+
+insert into outer_tab             select 1, 1550 from generate_series(1, 200) i;
+insert into primscanmarkcov_table select 1, 1551 from generate_series(1, 200) i;
+
+vacuum analyze outer_tab;
+vacuum analyze primscanmarkcov_table ;
+
+with range_ints as ( select i from generate_series(1530, 1780) i)
+
+select
+  count(*), buggy.a, buggy.b from
+outer_tab o
+  inner join
+primscanmarkcov_table buggy
+  on o.a = buggy.a and o.b = buggy.b
+where
+  o.a = 1     and     o.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])  and
+  buggy.a = 1 and buggy.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])
+group by buggy.a, buggy.b
+order by buggy.a, buggy.b;
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys() for backwards scans.  More or less comparable to
+-- the last test.
+create temp table backwards_prim_outer_table             (a int, b int);
+create temp table backwards_restore_buggy_primscan_table (x int, y int);
+
+create index backward_prim_buggy_idx  on backwards_restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+create index backwards_prim_drive_idx on backwards_prim_outer_table             (a, b) with (deduplicate_items=off);
+
+insert into backwards_prim_outer_table                  select 0, 1360;
+insert into backwards_prim_outer_table                  select 1, b_vals from generate_series(1012, 1406) b_vals where b_vals % 10 = 0;
+insert into backwards_prim_outer_table                  select 1, 1370;
+vacuum analyze backwards_prim_outer_table; -- Be tidy
+
+-- Fill up "backwards_prim_drive_idx" index with 396 items, just about fitting
+-- onto its only page, which is a root leaf page:
+insert into backwards_restore_buggy_primscan_table select 0, 1360;
+insert into backwards_restore_buggy_primscan_table select 1, x_vals from generate_series(1012, 1406) x_vals;
+vacuum analyze backwards_restore_buggy_primscan_table; -- Be tidy
+
+-- Now cause two page splits, leaving 4 leaf pages in total:
+insert into backwards_restore_buggy_primscan_table select 1, 1370 from generate_series(1,250) i;
+
+-- Now "buggy" index looks like this:
+--
+-- ┌───┬───────┬───────┬────────┬────────┬────────────┬───────┬───────┬───────────────────┬─────────┬───────────┬──────────────────┐
+-- │ i │ blkno │ flags │ nhtids │ nhblks │ ndeadhblks │ nlive │ ndead │ nhtidschecksimple │ avgsize │ freespace │     highkey      │
+-- ├───┼───────┼───────┼────────┼────────┼────────────┼───────┼───────┼───────────────────┼─────────┼───────────┼──────────────────┤
+-- │ 1 │     1 │     1 │    203 │      1 │          0 │   204 │     0 │                 0 │      16 │     4,068 │ (x, y)=(1, 1214) │
+-- │ 2 │     4 │     1 │    156 │      2 │          0 │   157 │     0 │                 0 │      16 │     5,008 │ (x, y)=(1, 1370) │
+-- │ 3 │     5 │     1 │    251 │      2 │          0 │   252 │     0 │                 0 │      16 │     3,108 │ (x, y)=(1, 1371) │
+-- │ 4 │     2 │     1 │     36 │      1 │          0 │    36 │     0 │                 0 │      16 │     7,428 │ ∅                │
+-- └───┴───────┴───────┴────────┴────────┴────────────┴───────┴───────┴───────────────────┴─────────┴───────────┴──────────────────┘
+
+select count(*), o.a, o.b
+  from
+    backwards_prim_outer_table o
+  inner join
+    backwards_restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x in (0, 1) and
+  bug.y = any(array[(select array_agg(i) from generate_series(1360, 1370) i where i % 10 = 0)])
+group by o.a, o.b
+order by o.a desc, o.b desc;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..9d68ef624 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,18 +774,14 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.43.0

#34

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#33)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, 28 Dec 2023 at 18:28, Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v8, which pretty much rips all of this stuff out.

Attached is v9, which I'm posting just to fix bitrot. The patch
stopped cleanly applying against HEAD due to recent bugfix commit
7e6fb5da. No real changes here compared to v8.

I found 2 major issues; one correctness issue in the arraykey
processing code of _bt_preprocess_array_keys, and one assertion error
in _bt_advance_array_keys; both discussed in the relevant sections
below.

Subject: [PATCH v9] Enhance nbtree ScalarArrayOp execution.
[...]
Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute. These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Can you pull these planner changes into their own commit(s)?
As mentioned upthread, it's a significant change in behavior that
should have separate consideration and reference in the commit log. I
really don't think it should be buried in the 5th paragraph of an
"Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
changes of btree are arguably independent of the planner changes, as
the btree changes improve performance even if we ignore that it
implements strict result ordering.

An independent thought when reviewing the finaltup / HIKEY scan
optimization parts of this patch:
The 'use highkey to check for next page's data range' optimization is
useful, but can currently only be used for scans that go to the right.
Could we use a similar optimization in the inverse direction if we
marked the right page of a split with "the left page has a HIKEY based
on this page's (un)truncated leftmost value" or "left page's HIKEY was
truncated to N key attributes"? It'd give one bit of information,
specifically that it does (or does not) share some (or all) key
attributes with this page's minimum value, which allows for earlier
scan termination in boundary cases.

+++ b/src/include/access/nbtree.h
+#define SK_BT_RDDNARRAY    0x00040000    /* redundant in array preprocessing */

I think "made redundant" is more appropriate than just "redundant";
the array key is not itself redundant in array preprocessing (indeed
we actually work hard on that key during preprocessing to allow us to
mark it redundant)

+++ b/src/backend/access/nbtree/nbtsearch.c
* We skip this for the first page in the scan to evade the possible
-     * slowdown of the point queries.
+     * slowdown of point queries.  Never apply the optimization with a scan
+     * that uses array keys, either, since that breaks certain assumptions.
+     * (Our search-type scan keys change whenever _bt_checkkeys advances the
+     * arrays, invalidating any precheck.  Tracking all that would be tricky.)

I think this was mentioned before, but can't we have an argument or
version of _bt_checkkeys that makes it not advance the array keys, so
that we can keep this optimization? Or, isn't that now
_bt_check_compare?
For an orderlines table with 1000s of lines per order, we would
greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
handles scan keys more efficiently; the optimization which is being
disabled here.

+++ b/src/backend/access/nbtree/nbtutils.c
_bt_preprocess_array_keys(IndexScanDesc scan)

The _bt_preprocess_array_keys code is now broken due to type
confusion, as it assumes there is only one array subtype being
compared per attribute in so.orderProcs. Reproducer:
CREATE TABLE test AS
SELECT generate_series(1, 10000, 1::bigint) num;
CREATE INDEX ON test (num); /* bigint typed */

SELECT num FROM test
WHERE num = ANY ('{1}'::smallint[])
AND num = ANY ('{1}'::int[]) /* ensure matching
lastEqualityArrayAtt, lastOrderProc for next qual
AND num = ANY ('{65537}'::int[]); /* qual is broken due to
application of smallint compare operator on int values that do equal
mod 2^16, but do not equal in their own type */
num
-----
1

I'm also concerned about the implications of this in
_bt_binsrch_array_skey, as this too assumes the same compare operator
is always used for all array operations on each attribute. We may need
one so->orderProcs entry for each array key, but at least one per
sk_subtype of each array key.

I also notice that the merging of values doesn't seem to be applied
optimally with mixed typed array operations: num = int[] AND num =
bigint[] AND num = int[] doesn't seem to merge the first and last
array ops. I'm also concerned about being (un)able to merge

+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */

As I mentioned upthread, I find this function to be very wasteful, as
it uses N binary searches to merge join two already sorted arrays,
resulting in a O(n log(m)) complexity, whereas a normal merge join
should be O(n + m) once the input datasets are sorted.
Please fix this, as it shows up in profiling of large array merges.
Additionally, as it merges two arrays of unique items into one,
storing only matching entries, I feel that it is quite wasteful to do
this additional allocation here. Why not reuse the original allocation
immediately?

+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+                             IndexTuple tuple, int sktrig, bool validtrig)

I don't quite understand what the 'validtrig' argument is used for.
There is an assertion that it is false under some conditions in this
code, but it's not at all clear to me why that would have to be the
case - it is called with `true` in one of the three call sites. Could
the meaning of this be clarified?

I also feel that this patch includes several optimizations such as
this sktrig argument which aren't easy to understand. Could you pull
that into a separately reviewable patch?

Additionally, could you try to create a single point of entry for the
array key stuff that covers the new systems? I've been trying to wrap
my head around this, and it's taking a lot of effort.

_bt_advance_array_keys

Thinking about the implementation here:
We require transitivity for btree opclasses, where A < B implies NOT A
= B, etc. Does this also carry over into cross-type operators? E.g. a
type 'truncatedint4' that compares only the highest 16 bits of an
integer would be strictly sorted, and could compare 0::truncatedint4 =
0::int4 as true, as well as 0::truncatedint4 = 2::int4, while 0::int4
= 2::int4 is false.
Would it be valid to add support methods for truncatedint4 to an int4
btree opclass, or is transitivity also required for all operations?
i.e. all values that one operator class considers unique within an
opfamily must be considered unique for all additional operators in the
opfamily, or is that not required?
If not, then that would pose problems for this patch, as the ordering
of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
('{0,65536}'::truncatedint4[]) could potentially skip results.

I'm also no fan of the (tail) recursion. I would agree that this is
unlikely to consume a lot of stack, but it does consume stackspace
nonetheless, and I'd prefer if it was not done this way.

I notice an assertion error here:

+            Assert(cur->sk_strategy != BTEqualStrategyNumber);
+            Assert(all_required_sk_satisfied);
+            Assert(!foundRequiredOppositeDirOnly);
+
+            foundRequiredOppositeDirOnly = true;

This assertion can be hit with the following test case:

CREATE TABLE test AS
SELECT i a, i b, i c FROM generate_series(1, 1000) i;
CREATE INDEX ON test(a, b, c); ANALYZE;
SELECT count(*) FROM test
WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
AND b = ANY ('{1,2,3}');

+_bt_update_keys_with_arraykeys(IndexScanDesc scan)

I keep getting confused by the mixing of integer increments and
pointer increments. Could you explain why in this code you chose to
increment a pointer for "ScanKey cur", while using arrray indexing for
other fields? It feels very arbitrary to me, and that makes the code
difficult to follow.

+++ b/src/test/regress/sql/btree_index.sql
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.

I notice this adds a complex test case that outputs many rows. Can we
do with less rows if we build the index after data insertion, and with
a lower (non-default) fillfactor?

Note: I did not yet do any in-depth review of the planner changes in
indxpath.c/selfuncs.c.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#35

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#34)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Can you pull these planner changes into their own commit(s)?
As mentioned upthread, it's a significant change in behavior that
should have separate consideration and reference in the commit log. I
really don't think it should be buried in the 5th paragraph of an
"Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
changes of btree are arguably independent of the planner changes, as
the btree changes improve performance even if we ignore that it
implements strict result ordering.

I'm not going to break out the planner changes, because they're *not*
independent in any way. You could say the same thing about practically
any work that changes the planner. They're "buried" in the 5th
paragraph of the commit message. If an interested party can't even
read that far to gain some understanding of a legitimately complicated
piece of work such as this, I'm okay with that.

An independent thought when reviewing the finaltup / HIKEY scan
optimization parts of this patch:
The 'use highkey to check for next page's data range' optimization is
useful, but can currently only be used for scans that go to the right.
Could we use a similar optimization in the inverse direction if we
marked the right page of a split with "the left page has a HIKEY based
on this page's (un)truncated leftmost value" or "left page's HIKEY was
truncated to N key attributes"? It'd give one bit of information,
specifically that it does (or does not) share some (or all) key
attributes with this page's minimum value, which allows for earlier
scan termination in boundary cases.

That would have to be maintained in all sorts of different places in
nbtree. And could be broken at any time by somebody inserting a
non-pivot tuple before every existing one on the leaf page. Doesn't
seem worth it to me.

If we were to do something like this then it would be discussed as
independent work. It's akin to adding a low key, which could be used
in several different places. As you say, it's totally independent.

+++ b/src/include/access/nbtree.h
+#define SK_BT_RDDNARRAY    0x00040000    /* redundant in array preprocessing */
I think "made redundant" is more appropriate than just "redundant";
the array key is not itself redundant in array preprocessing (indeed
we actually work hard on that key during preprocessing to allow us to
mark it redundant)

Meh. I did it that way to fit under 78 chars while staying on the same
line. I don't think that it matters.

I think this was mentioned before, but can't we have an argument or
version of _bt_checkkeys that makes it not advance the array keys, so
that we can keep this optimization? Or, isn't that now
_bt_check_compare?

As I said to Heikki, thinking about this some more is on my todo list.
I mean the way that this worked had substantial revisions on HEAD
right before I posted v9. v9 was only to fix the bit rot that that
caused.

For an orderlines table with 1000s of lines per order, we would
greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
handles scan keys more efficiently; the optimization which is being
disabled here.

The way that these optimizations might work with the mechanism from
the patch isn't some kind of natural extension to what's there
already. We'll need new heuristics to not waste cycles. Applying all
of the optimizations together just isn't trivial, and it's not yet
clear what really makes sense. Combining the two optimizations more or
less adds another dimension of complexity.

+++ b/src/backend/access/nbtree/nbtutils.c
_bt_preprocess_array_keys(IndexScanDesc scan)
The _bt_preprocess_array_keys code is now broken due to type
confusion, as it assumes there is only one array subtype being
compared per attribute in so.orderProcs.

I've been aware of this for some time, but didn't think that it was
worth bringing up before I had a solution to present...

I'm also concerned about the implications of this in
_bt_binsrch_array_skey, as this too assumes the same compare operator
is always used for all array operations on each attribute. We may need
one so->orderProcs entry for each array key, but at least one per
sk_subtype of each array key.

...since right now it's convenient to make so->orderProcs have a 1:1
correspondence with index key columns....

I also notice that the merging of values doesn't seem to be applied
optimally with mixed typed array operations: num = int[] AND num =
bigint[] AND num = int[] doesn't seem to merge the first and last
array ops. I'm also concerned about being (un)able to merge

...which ought to work and be robust, once the cross-type support is
in place. That is, it should work once we really can assume that there
really must be exactly one so->orderProcs entry per equality-strategy
scan key in all cases -- including in highly obscure corner cases
involving a mix of cross-type comparisons that are also redundant.
(Though as I go into below, I can see at least 2 viable solutions to
the problem.)

+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
As I mentioned upthread, I find this function to be very wasteful, as
it uses N binary searches to merge join two already sorted arrays,
resulting in a O(n log(m)) complexity, whereas a normal merge join
should be O(n + m) once the input datasets are sorted.

And as I mentioned upthread, I think that you're making a mountain out
of a molehill here. This is not a merge join. Even single digit
thousand of array elements should be considered huge. Plus this code
path is only hit with poorly written queries.

Please fix this, as it shows up in profiling of large array merges.
Additionally, as it merges two arrays of unique items into one,
storing only matching entries, I feel that it is quite wasteful to do
this additional allocation here. Why not reuse the original allocation
immediately?

Because that's adding more complexity for a code path that will hardly
ever be used in practice. For something that happens once, during a
preprocessing step.

+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+                             IndexTuple tuple, int sktrig, bool validtrig)
I don't quite understand what the 'validtrig' argument is used for.
There is an assertion that it is false under some conditions in this
code, but it's not at all clear to me why that would have to be the
case - it is called with `true` in one of the three call sites. Could
the meaning of this be clarified?

Sure, I'll add a comment.

I also feel that this patch includes several optimizations such as
this sktrig argument which aren't easy to understand. Could you pull
that into a separately reviewable patch?

It probably makes sense to add the extra preprocessing stuff out into
its own commit, since I tend to agree that that's an optimization that
can be treated as unrelated (and isn't essential to the main thrust of
the patch).

However, the sktrig thing isn't really like that. We need to do things
that way for required inequality scan keys. It doesn't make sense to
not just do it for all required scan keys (both equality and
inequality strategy scan keys) right from the start.

Additionally, could you try to create a single point of entry for the
array key stuff that covers the new systems? I've been trying to wrap
my head around this, and it's taking a lot of effort.

I don't understand what you mean here.

_bt_advance_array_keys

Thinking about the implementation here:
We require transitivity for btree opclasses, where A < B implies NOT A
= B, etc. Does this also carry over into cross-type operators?

Yes, it carries like that.

Would it be valid to add support methods for truncatedint4 to an int4
btree opclass, or is transitivity also required for all operations?
i.e. all values that one operator class considers unique within an
opfamily must be considered unique for all additional operators in the
opfamily, or is that not required?
If not, then that would pose problems for this patch, as the ordering
of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
('{0,65536}'::truncatedint4[]) could potentially skip results.

There are roughly two ways to deal with this sort of thing (which
sounds like a restatement of the issue shown by your test case?). They
are:

1. Add support for detecting redundant scan keys, even when cross-type
operators are involved. (Discussed earlier.)

2. Be prepared to have more than one scan key per index key column in
rare edge-cases. This means that I can no longer subscript
so->orderProcs using an attnum.

I intend to use whichever approach works out to be the simplest and
most maintainable. Note that there is usually nothing that stops me
from having redundant scan keys if it can't be avoided via
preprocessing -- just like today.

Actually...I might have to use a hybrid of 1 and 2. Because we need to
be prepared for the possibility that it just isn't possible to
determine redundancy at all due to a lack of suitable cross-type
operators -- I don't think that it would be okay to just throw an
error there (that really would be a regression against Postgres 16).
For that reason I will most likely need to find a way for
so->orderProcs to be subscripted using something that maps to equality
scan keys (rather than having it map to a key column).

I'm also no fan of the (tail) recursion. I would agree that this is
unlikely to consume a lot of stack, but it does consume stackspace
nonetheless, and I'd prefer if it was not done this way.

I disagree. The amount of stack space it consumes in the worst case is fixed.

I notice an assertion error here:
+            Assert(cur->sk_strategy != BTEqualStrategyNumber);
+            Assert(all_required_sk_satisfied);
+            Assert(!foundRequiredOppositeDirOnly);
+
+            foundRequiredOppositeDirOnly = true;
This assertion can be hit with the following test case:

CREATE TABLE test AS
SELECT i a, i b, i c FROM generate_series(1, 1000) i;
CREATE INDEX ON test(a, b, c); ANALYZE;
SELECT count(*) FROM test
WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
AND b = ANY ('{1,2,3}');

Will fix.

+_bt_update_keys_with_arraykeys(IndexScanDesc scan)

I keep getting confused by the mixing of integer increments and
pointer increments. Could you explain why in this code you chose to
increment a pointer for "ScanKey cur", while using arrray indexing for
other fields? It feels very arbitrary to me, and that makes the code
difficult to follow.

Because in one case we follow the convention for scan keys, whereas
so->orderProcs is accessed via an attribute number subscript.

+++ b/src/test/regress/sql/btree_index.sql
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.

I notice this adds a complex test case that outputs many rows. Can we
do with less rows if we build the index after data insertion, and with
a lower (non-default) fillfactor?

Probably not. It was actually very hard to come up with these test
cases, which tickle the implementation in just the right way to
demonstrate that the code in places like _bt_steppage() is actually
required. It took me a rather long time to just prove that much. Not
sure that we really need this. But thought I'd include it for the time
being, just so that reviewers could understand those changes.

--
Peter Geoghegan

#36

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Peter Geoghegan (#33)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Dec 28, 2023 at 12:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v9, which I'm posting just to fix bitrot. The patch
stopped cleanly applying against HEAD due to recent bugfix commit
7e6fb5da. No real changes here compared to v8.

Attached is v10, which is another revision most just intended to fix
bitrot. This time from changes to selfuncs.c on HEAD.

v10 also fixes the assertion failure reported by Matthias in passing,
just because it was easy to include. No changes for the other open
items, though.

--
Peter Geoghegan

Attachments:

v10-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v10-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 660ed6d1059f2d070d49b578319f10e5d8e80832 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v10] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The array state machine now advances using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 doc/src/sgml/monitoring.sgml               |   15 +
 src/backend/access/nbtree/nbtree.c         |   80 +-
 src/backend/access/nbtree/nbtsearch.c      |  122 +-
 src/backend/access/nbtree/nbtutils.c       | 1816 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   86 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 src/include/access/nbtree.h                |   49 +-
 src/test/regress/expected/btree_index.out  |  479 ++++++
 src/test/regress/expected/create_index.out |   31 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/btree_index.sql       |  147 ++
 src/test/regress/sql/create_index.sql      |   10 +-
 12 files changed, 2601 insertions(+), 361 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b804eb8b5..0dd80cc71 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4062,6 +4062,21 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 696d79c08..f28fa227f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -235,7 +235,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		_bt_start_array_keys(scan, dir);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -277,8 +277,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -305,7 +305,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 		_bt_start_array_keys(scan, ForwardScanDirection);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -335,8 +335,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -366,9 +366,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -408,7 +410,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -507,10 +511,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -521,10 +521,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -563,6 +559,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Rewind the scan's array keys, if any */
+			if (so->numArrayKeys)
+				_bt_rewind_array_keys(scan);
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -589,7 +588,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -615,7 +614,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -626,7 +625,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -657,16 +660,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -698,9 +702,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -734,12 +741,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -753,14 +759,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -769,13 +775,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 63ee9ba22..7b3bbd882 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,7 +907,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -1527,10 +1527,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
+	BTReadPageState pstate;
+	int			numArrayKeys,
+				itemIndex;
 	int			indnatts;
-	bool		continuescanPrechecked;
+	bool		continuescanPrechecked = false;
 	bool		haveFirstMatch = false;
 
 	/*
@@ -1551,8 +1552,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.finaltupchecked = false;
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	numArrayKeys = so->numArrayKeys;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1599,9 +1605,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * the last item on the page would give a more precise answer.
 	 *
 	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * slowdown of point queries.  Never apply the optimization with a scan
+	 * that uses array keys, either, since that breaks certain assumptions.
+	 * (Our search-type scan keys change whenever _bt_checkkeys advances the
+	 * arrays, invalidating any precheck.  Tracking all that would be tricky.)
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !numArrayKeys && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1619,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Flag variable is set when all scan keys that are required in the
+		 * current scan direction are satisfied by the last item on the page
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, itup, false, indnatts, false, false);
+		continuescanPrechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,8 +1661,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, false, indnatts,
 										 continuescanPrechecked,
 										 haveFirstMatch);
 
@@ -1659,9 +1670,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 * assert-enabled builds we also recheck that the _bt_checkkeys()
 			 * result is the same.
 			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			Assert((!continuescanPrechecked && haveFirstMatch) || numArrayKeys ||
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+												 indnatts, false, false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
@@ -1696,7 +1707,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1724,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			_bt_checkkeys(scan, &pstate, itup, true, truncatt, false, false);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1744,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (numArrayKeys && minoff <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1744,6 +1763,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1754,12 +1774,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1772,9 +1798,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+										 indnatts, continuescanPrechecked,
 										 haveFirstMatch);
 
 			/*
@@ -1782,9 +1807,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 * assert-enabled builds we also recheck that the _bt_checkkeys()
 			 * result is the same.
 			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			Assert((!continuescanPrechecked && !haveFirstMatch) || numArrayKeys ||
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup,
+												 finaltup, indnatts,
+												 false, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1824,7 +1850,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1999,6 +2025,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the right
+		 */
+		if (ScanDirectionIsBackward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreRight = true;
+			so->advanceDir = dir;
+			so->needPrimScan = false;
+		}
+
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 	}
@@ -2007,6 +2047,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreRight = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the left
+		 */
+		if (ScanDirectionIsForward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreLeft = true;
+			so->advanceDir = dir;
+			so->needPrimScan = false;
+		}
+
 		if (scan->parallel_scan != NULL)
 		{
 			/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2e6fc14d7..c26ed8132 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,42 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
+static int	_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple, int sktrig,
+										 bool validtrig);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int sktrig);
+static void _bt_update_keys_with_arraykeys(IndexScanDesc scan);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  int numArrayKeys, bool *continuescan, int *ikey,
+							  bool continuescanPrechecked, bool haveFirstMatch);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -190,13 +217,48 @@ _bt_freestack(BTStack stack)
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
  * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * array elements.
+ *
+ * _bt_preprocess_keys treats each primitive scan as an independent piece of
+ * work.  We perform all preprocessing that must work "across array keys".
+ * This division of labor makes sense once you consider that we're called only
+ * once per btrescan, whereas _bt_preprocess_keys is called once per primitive
+ * index scan.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array key as redundant.  Similarly, we are capable of "merging
+ * together" multiple equality array keys from two or more input scan keys
+ * into a single output scan key that contains only the intersecting array
+ * elements.  This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.
+ *
+ * Note: _bt_start_array_keys actually sets up the cur_elem counters later on,
+ * once the scan direction is known.
  *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
+ *
+ * Note: _bt_preprocess_keys is responsible for creating the so->keyData scan
+ * keys used by _bt_checkkeys.  Index scans that don't use equality array keys
+ * will have _bt_preprocess_keys treat scan->keyData as input and so->keyData
+ * as output.  Scans that use equality array keys have _bt_preprocess_keys
+ * treat so->arrayKeyData (which is our output) as their input, while (as per
+ * usual) outputting so->keyData for _bt_checkkeys.  This function adds an
+ * additional layer of indirection that allows _bt_preprocess_keys to more or
+ * less avoid dealing with SK_SEARCHARRAY as a special case.
+ *
+ * Note: _bt_update_keys_with_arraykeys works by updating already-processed
+ * output keys (so->keyData) in-place.  It cannot eliminate redundant or
+ * contradictory scan keys.  This necessitates having _bt_preprocess_keys
+ * understand that it is unsafe to eliminate "redundant" SK_SEARCHARRAY
+ * equality scan keys on the basis of what is actually just the current array
+ * key values -- it must conservatively assume that such a scan key might no
+ * longer be redundant after the next _bt_update_keys_with_arraykeys call.
+ * Ideally we'd be able to deal with that by eliminating a subset of truly
+ * redundant array keys up-front, but it doesn't seem worth the trouble.
  */
 void
 _bt_preprocess_array_keys(IndexScanDesc scan)
@@ -204,7 +266,10 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
+	int			lastEqualityArrayAtt = -1;
+	Oid			lastOrderProc = InvalidOid;
 	ScanKey		cur;
 	int			i;
 	MemoryContext oldContext;
@@ -257,6 +322,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc0(nkeyatts * sizeof(FmgrInfo));
+	so->advanceDir = NoMovementScanDirection;
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -273,6 +340,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.   Set
+		 * that up now.  (Avoids repeating work for the same attribute.)
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			!OidIsValid(so->orderProcs[cur->sk_attno - 1].fn_oid))
+			_bt_sort_array_cmp_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -349,6 +426,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and redundant whole scan keys)
+		 */
+		if (lastEqualityArrayAtt == cur->sk_attno &&
+			lastOrderProc == cur->sk_func.fn_oid)
+		{
+			BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+			Assert(so->arrayKeyData[prev->scan_key].sk_subtype ==
+				   cur->sk_subtype);
+
+			num_elems = _bt_merge_arrays(scan, cur,
+										 (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+										 prev->elem_values, prev->num_elems,
+										 elem_values, num_elems);
+
+			pfree(elem_values);
+
+			/*
+			 * If there are no intersecting elements left from merging this
+			 * array into the previous array on the same attribute, the scan
+			 * qual is unsatisfiable
+			 */
+			if (num_elems == 0)
+			{
+				numArrayKeys = -1;
+				break;
+			}
+
+			/*
+			 * Lower the number of elements from the previous array, and mark
+			 * this scan key/array as redundant for every primitive index scan
+			 */
+			prev->num_elems = num_elems;
+			cur->sk_flags |= SK_BT_RDDNARRAY;
+			continue;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
 		 */
@@ -356,6 +473,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		lastEqualityArrayAtt = cur->sk_attno;
+		lastOrderProc = cur->sk_func.fn_oid;
 	}
 
 	so->numArrayKeys = numArrayKeys;
@@ -429,26 +548,28 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 }
 
 /*
- * _bt_sort_array_elements() -- sort and de-dup array elements
+ * _bt_sort_array_cmp_setup() -- Look up array comparison function
  *
- * The array elements are sorted in-place, and the new number of elements
- * after duplicate removal is returned.
- *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * Sets so->orderProcs[] for scan key's attribute.  This is used to sort and
+ * deduplicate the attribute's array (if any).  It's also used during binary
+ * searches of the next array key matching index tuples just beyond the range
+ * of the scan's current set of array keys.
  */
-static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
-						Datum *elems, int nelems)
+static void
+_bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey)
 {
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 	Oid			elemtype;
 	RegProcedure cmp_proc;
-	BTSortArrayContext cxt;
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
 
-	if (nelems <= 1)
-		return nelems;			/* no work to do */
+	/*
+	 * Should do this for all equality strategy scan keys only (including
+	 * those without any array).  See _bt_advance_array_keys for details of
+	 * why we need an ORDER proc for non-array equality strategy scan keys.
+	 */
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
 
 	/*
 	 * Determine the nominal datatype of the array elements.  We have to
@@ -462,22 +583,44 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 	/*
 	 * Look up the appropriate comparison function in the opfamily.
 	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
+	 * Note: it's possible that this would fail, if the opfamily lacks the
+	 * required cross-type ORDER proc.  But this is no different to the case
+	 * where _bt_first fails to find an ORDER proc for its insertion scan key.
 	 */
 	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
+								 rel->rd_opcintype[skey->sk_attno - 1], elemtype,
 								 BTORDER_PROC);
 	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, rel->rd_opcintype[skey->sk_attno - 1], elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Save in orderproc entry for attribute */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
+/*
+ * _bt_sort_array_elements() -- sort and de-dup array elements
+ *
+ * The array elements are sorted in-place, and the new number of elements
+ * after duplicate removal is returned.
+ *
+ * scan and skey identify the index column, whose opfamily determines the
+ * comparison semantics.  If reverse is true, we sort in descending order.
+ */
+static int
+_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
+						bool reverse,
+						Datum *elems, int nelems)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+
+	if (nelems <= 1)
+		return nelems;			/* no work to do */
 
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -488,6 +631,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	int			merged_nelems = 0;
+
+	/*
+	 * Incrementally copy the original array into a temp buffer, skipping over
+	 * any items that are missing from the "next" array
+	 */
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0; i < nelems_orig; i++)
+	{
+		Datum	   *elem = elems_orig + i;
+
+		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+						_bt_compare_array_elements, &cxt))
+			merged[merged_nelems++] = *elem;
+	}
+
+	/*
+	 * Overwrite the original array with temp buffer so that we're only left
+	 * with intersecting array elements
+	 */
+	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	pfree(merged);
+
+	return merged_nelems;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -499,7 +684,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -507,6 +692,159 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  It's safe for searches against required scan key arrays to
+ * reuse earlier search bounds like this because such arrays always advance in
+ * lockstep with the index scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem,
+				mid_elem,
+				high_elem,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	low_elem = 0;
+	mid_elem = -1;
+	high_elem = array->num_elems - 1;
+	if (cur_elem_start)
+	{
+		if (ScanDirectionIsForward(dir))
+			low_elem = array->cur_elem;
+		else
+			high_elem = array->cur_elem;
+	}
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * it's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the array element we'll return.  Set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -532,29 +870,40 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
 
-	so->arraysStarted = true;
+	so->advanceDir = dir;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
+ *
+ * Note: routine only initializes so->arrayKeyData[] scankeys.  Caller must
+ * either call _bt_update_keys_with_arraykeys or call _bt_preprocess_keys to
+ * update the scan's search-type scankeys.
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		found = false;
-	int			i;
+
+	Assert(!so->needPrimScan);
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
@@ -588,85 +937,988 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+	if (found)
+		return true;
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * Don't allow the entire set of array keys to roll over: restore the
+	 * array keys to the state they were in before we were called.
+	 *
+	 * This ensures that the array keys only ratchet forward (or backwards in
+	 * the case of backward scans).  Our "so->arrayKeyData[]" scan keys should
+	 * always match the current "so->keyData[]" search-type scan keys (except
+	 * for a brief moment during array key advancement).
 	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
-}
-
-/*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
- *
- * Save the current state of the array keys as the "mark" position.
- */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
-
-	for (i = 0; i < so->numArrayKeys; i++)
+	for (int i = 0; i < so->numArrayKeys; i++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		BTArrayKeyInfo *rollarray = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[rollarray->scan_key];
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
+		if (ScanDirectionIsBackward(dir))
+			rollarray->cur_elem = 0;
+		else
+			rollarray->cur_elem = rollarray->num_elems - 1;
+		skey->sk_argument = rollarray->elem_values[rollarray->cur_elem];
 	}
+
+	return false;
 }
 
 /*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
+ * _bt_rewind_array_keys() -- Handle array keys during btrestrpos
  *
- * Restore the array keys to where they were when the mark was set.
+ * Restore the array keys to the start of the key space for the current scan
+ * direction.
  */
 void
-_bt_restore_array_keys(IndexScanDesc scan)
+_bt_rewind_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		changed = false;
-	int			i;
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->advanceDir != NoMovementScanDirection);
+
+	/*
+	 * Restore each array key to its initial position for the current scan
+	 * direction as of the last time the arrays advanced
+	 */
+	for (int i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		int			first_elem_dir;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		if (ScanDirectionIsForward(so->advanceDir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = curArrayKey->num_elems - 1;
+
+		if (curArrayKey->cur_elem != first_elem_dir)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
+			curArrayKey->cur_elem = first_elem_dir;
+			skey->sk_argument = curArrayKey->elem_values[first_elem_dir];
 			changed = true;
 		}
 	}
 
+	if (changed)
+		_bt_update_keys_with_arraykeys(scan);
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
+	 * Invert the scan direction as of the last time the array keys advanced.
 	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * This prevents _bt_steppage from fully trusting currPos.moreRight and
+	 * currPos.moreLeft in cases where _bt_readpage/_bt_checkkeys don't get
+	 * the opportunity to consider advancing the array keys as expected.
 	 */
-	if (changed || !so->arraysStarted)
-	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
+	if (ScanDirectionIsForward(so->advanceDir))
+		so->advanceDir = BackwardScanDirection;
+	else
+		so->advanceDir = ForwardScanDirection;
 }
 
+/*
+ * _bt_tuple_before_array_skeys() -- _bt_checkkeys array helper function
+ *
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) must advance the scan's array keys.
+ * Only call here when _bt_check_compare already set continuescan=false.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfying our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it is now time for our
+ * caller to advance the array keys (unless caller broke the rules by not
+ * checking with _bt_check_compare before calling here).
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * _bt_advance_array_keys and its handling of inequalities for details.
+ *
+ * Note: caller passes _bt_check_compare-set sktrig value to indicate which
+ * scan key triggered the call.  If this is for any scan key that isn't a
+ * required equality strategy scan key, calling here is a no-op, meaning that
+ * we'll invariably return false.  We just accept whatever _bt_check_compare
+ * indicated about the scan when it involves a required inequality scan key.
+ * We never care about nonrequired scan keys, including equality strategy
+ * array scan keys (though _bt_check_compare can temporarily end the scan to
+ * advance their arrays in _bt_advance_array_keys, which we'll never prevent).
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple, int sktrig, bool validtrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData + sktrig, ikey = sktrig;
+		 ikey < so->numberOfKeys;
+		 cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
+
+		/*
+		 * Unlike _bt_check_compare and _bt_advance_array_keys, we never deal
+		 * with inequality strategy scan keys (even those marked required). We
+		 * also don't deal with non-required equality keys -- even when they
+		 * happen to have arrays that might need to be advanced.
+		 *
+		 * Note: cannot "break" here due to corner cases involving redundant
+		 * scan keys that weren't eliminated within _bt_preprocess_keys.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			(cur->sk_flags & SK_BT_REQFWD) == 0)
+			continue;
+
+		/* Required equality scan keys always required in both directions */
+		Assert((cur->sk_flags & SK_BT_REQFWD) &&
+			   (cur->sk_flags & SK_BT_REQBKWD));
+
+		if (attnum > ntupatts)
+		{
+			Assert(!validtrig);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys, forcing another _bt_advance_array_keys call.
+			 *
+			 * You might wonder why we don't treat truncated attributes as
+			 * having values < our equality constraints instead; we're not
+			 * treating the truncated attributes as having -inf values here,
+			 * which is how things are done in _bt_compare.
+			 *
+			 * We're often called during finaltup prechecks, where we help our
+			 * caller to decide whether or not it should terminate the current
+			 * primitive index scan.  Our behavior here implements a policy of
+			 * being slightly optimistic about what will be found on the next
+			 * page when the current primitive scan continues onto that page.
+			 * (This is also closest to what _bt_check_compare does.)
+			 */
+			return false;
+		}
+
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?  (This implements the linear search process
+		 * described in _bt_advance_array_keys.)
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?
+		 */
+		if (validtrig || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconcusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or perhaps a call made from an
+		 * assertion.
+		 */
+		Assert(result == 0);
+		Assert(!validtrig);
+	}
+
+	/*
+	 * Default assumption is that caller must now advance the array keys.
+	 *
+	 * Note that we'll always end up here when sktrig corresponds to some
+	 * non-required array type scan key that _bt_check_compare saw wasn't
+	 * satisfied by caller's tuple.
+	 */
+	return false;
+}
+
+/*
+ * _bt_array_keys_remain() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(so->advanceDir == dir);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  This tells us what to do.  We cannot rely on _bt_first
+	 * always reaching _bt_checkkeys, though.  There are various cases where
+	 * that won't happen.  For example, if the index is completely empty, then
+	 * _bt_first won't get as far as calling _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys...
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * ...though there is one exception: _bt_first's _bt_preprocess_keys
+		 * call can determine that the scan's input scan keys can never be
+		 * satisfied.  That might be true for one set of array keys, but not
+		 * the next set.
+		 *
+		 * Handle this by advancing the array keys incrementally ourselves.
+		 * When this succeeds, start another primitive index scan.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		Assert(!so->needPrimScan);
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+
+		/* Array keys are now exhausted */
+	}
+
+	/*
+	 * Has another primitive index scan been scheduled by _bt_checkkeys?
+	 */
+	if (so->needPrimScan)
+	{
+		/* Yes -- tell caller to call _bt_first once again */
+		so->needPrimScan = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Terminate the top-level scan.
+	 */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Like _bt_check_compare, our return value indicates if tuple satisfied the
+ * qual (specifically our new qual).  There must be a new qual whenever we're
+ * called (unless the top-level scan terminates).  After we return, all later
+ * calls to _bt_check_compare will also use the same new qual (a qual with the
+ * newly advanced array key values that were set here by us).
+ *
+ * We'll also set pstate.continuescan for caller.  When this is set to false,
+ * it usually just ends the ongoing primitive index scan (we'll have scheduled
+ * another one in passing).  But when all required array keys were exhausted,
+ * setting pstate.continuescan=false here ends the top-level index scan (since
+ * no new primitive scan will have been scheduled).  Most calls here will have
+ * us set pstate.continuescan=true, which just indicates that the scan should
+ * proceed onto the next tuple (just like when _bt_check_compare does it).
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys.  Calling here before that
+ * point will prematurely advance the array keys, leading to wrong query
+ * results.
+ *
+ * We're responsible for ensuring that caller's tuple is <= current/newly
+ * advanced required array keys once we return.  We try to find an exact
+ * match, but failing that we'll advance the array keys to whatever set of
+ * array elements comes next in the key space for the current scan direction.
+ * Required array keys "ratchet forwards".  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The invariants are the same for backwards scans, except that the operators
+ * are flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Sometimes that'll
+ * be the sole reason for calling here.  These calls are the only exception to
+ * the general rule about always advancing required array keys (since they're
+ * the only case where we simply don't need to touch any required array, which
+ * must already be satisfied by caller's tuple).  Calls triggered by any scan
+ * key that's required in the current scan direction are strictly guaranteed
+ * to advance the required array keys (or end the top-level scan), though.
+ *
+ * Note that we deal with all required equality strategy scan keys here; it's
+ * not limited to array scan keys.  They're equality constraints for our
+ * purposes, and so are handled as degenerate single element arrays here.
+ * Obviously, they can never really advance in the way that real arrays can,
+ * but they must still affect how we advance real array scan keys, just like
+ * any other equality constraint.  We have to keep around a 3-way ORDER proc
+ * for these (just using the "=" operator won't do), since in general whether
+ * the tuple is < or > some non-array equality key might influence advancement
+ * of any of the scan's actual arrays.  The top-level scan can only terminate
+ * after it has processed the key space covered by the product of each and
+ * every equality constraint, including both non-arrays and (required) arrays.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple according to _bt_check_compare.  This is how we
+ * deal with inequalities that are required in the current scan direction.
+ * They can advance the array keys here, even though they don't influence the
+ * initial positioning strategy within _bt_first (only inequalities required
+ * in the _opposite_ direction to the scan influence _bt_first in this way).
+ *
+ * As discussed already, we guarantee that the array keys will either be
+ * advanced such that caller's tuple is <= the new array keys in respect of
+ * required array keys (plus any other required equality strategy scan keys)
+ * when we return (unless the arrays are totally exhausted instead).  The real
+ * guarantee is actually slightly stronger than that, though it only matters
+ * to scans that have required inequality strategy scan keys.  The precise
+ * promise we make is that the array keys will always advance to the maximum
+ * possible extent that we can know to be safe based on caller's tuple alone.
+ * Note that it's just about possible that every required equality strategy
+ * scan key will be satisfied (or could be satisfied by advancing the array
+ * keys), yet we might advance the array keys _beyond_ our exactly-matching
+ * element values due to a still-unsatisfied inequality strategy scan key.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_exhausted,
+				sktrigrequired = false,
+				beyond_end_advance = false,
+				foundRequiredOppositeDirOnly = false,
+				all_arraylike_sk_satisfied = true,
+				all_required_sk_satisfied = true;
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	/*
+	 * Iterate through the scan's search-type scankeys (so->keyData[]), and
+	 * set input scan keys (so->arrayKeyData[]) to new array values
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno;
+		Datum		tupdatum;
+		bool		requiredSameDir = false,
+					requiredOppositeDirOnly = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		/*
+		 * Set up ORDER 3-way comparison function and array state
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		if (cur->sk_flags & SK_SEARCHARRAY &&
+			cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+			Assert(skeyarray->sk_attno == attnum);
+		}
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			requiredSameDir = true;
+		else if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
+				 ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
+			requiredOppositeDirOnly = true;
+
+		/*
+		 * Optimization: Skip over known-satisfied scan keys
+		 */
+		if (ikey < sktrig)
+			continue;
+		if (ikey == sktrig)
+			sktrigrequired = requiredSameDir;
+
+		/*
+		 * When we come across an inequality scan key that's required in the
+		 * opposite direction only, and that might affect where our scan ends,
+		 * remember it.  We'll only need this information when all prior
+		 * equality constraints are satisfied.
+		 */
+		if (requiredOppositeDirOnly && sktrigrequired &&
+			all_arraylike_sk_satisfied)
+		{
+			Assert(cur->sk_strategy != BTEqualStrategyNumber);
+			Assert(all_required_sk_satisfied);
+
+			foundRequiredOppositeDirOnly = true;
+
+			continue;
+		}
+
+		/*
+		 * Other than that, we're not interested in scan keys that aren't
+		 * required in the current scan direction (unless they're non-required
+		 * array equality scan keys, which still need to be advanced by us)
+		 */
+		if (!requiredSameDir && !array)
+			continue;
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now. (We only do comparisons of any required
+		 * non-array equality scan keys that come after the triggering key.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; it knows all the intricacies around
+		 * evaluating inequality strategy scan keys (e.g., row comparisons).
+		 * There is no simple mapping onto the opclass ORDER proc we can use.
+		 * But once we know that we have an unsatisfied inequality, we can
+		 * treat it in the same way as an unsatisfied equality at this point.
+		 * (We don't need to worry about later required inequalities, since
+		 * there can't be any after the first one.  While it's possible that
+		 * _bt_preprocess_keys failed to determine which of several "required"
+		 * scan keys for this same attribute and same scan direction are truly
+		 * required, that changes nothing, really.  Even in this corner case,
+		 * we can safely assume that any other "required" inequality that is
+		 * still satisfied must have been redundant all along.)
+		 *
+		 * The arrays advance correctly in both cases because both involve the
+		 * scan reaching the end of the key space for a higher order array key
+		 * (or some distinct set of higher-order array keys, taken together).
+		 * The only real difference is that in the equality case the end is
+		 * "strictly at the end of an array key", whereas in the inequality
+		 * case it's "within an array key".  Either way we'll increment higher
+		 * order arrays by one increment (the next-highest array might need to
+		 * roll over to the next-next highest array in turn, and so on).
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(requiredSameDir);
+			Assert(!arrays_advanced);
+
+			beyond_end_advance = true;
+			all_arraylike_sk_satisfied = all_required_sk_satisfied = false;
+
+			continue;
+		}
+
+		/*
+		 * Nothing for us to do with a required inequality strategy scan key
+		 * that wasn't the one that _bt_check_compare stopped on
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 *
+		 * NB: We must do this for all arrays -- not just required arrays.
+		 * Otherwise the incremental array advancement step won't "carry".
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				skeyarray->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * required scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_arraylike_sk_satisfied || attnum > ntupatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			/*
+			 * Truncated -inf value will always be assumed to satisfy any
+			 * required equality scan keys according to _bt_check_compare.
+			 * This avoids a later _bt_check_compare recheck.
+			 *
+			 * Deliberately don't unset all_required_sk_satisfied here.  This
+			 * follows _bt_tuple_before_array_skeys's example.  We don't want
+			 * to treat -inf as a non-match when making a final decision on
+			 * whether to move to the next page.  This implements a policy of
+			 * being optimistic about finding real matches for lower-order
+			 * required attributes that are truncated to -inf in finaltup.
+			 */
+			all_arraylike_sk_satisfied = false;
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		ratchets = (requiredSameDir && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+											  tupdatum, tupnull,
+											  array, cur, &result);
+
+			/*
+			 * Required arrays only ever ratchet forwards (backwards).
+			 *
+			 * This condition makes it safe for binary searches to skip over
+			 * array elements that the scan must already be ahead of by now.
+			 * That is strictly an optimization.  Our assertion verifies that
+			 * the condition holds, which doesn't depend on the optimization.
+			 */
+			Assert(!ratchets ||
+				   ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+					(ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(requiredSameDir);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array.
+			 *
+			 * We really do need an ORDER proc for this (we can't just rely on
+			 * the scan key's equality operator).  We need to know whether the
+			 * tuple as a whole is either behind or ahead of (or covered by)
+			 * the key space represented by our required arrays as a group.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling of required equality scan keys
+		 * that lack a real array for us to advance, either.  It also won't
+		 * matter if all of the scan's real arrays are non-required arrays.
+		 */
+		if (requiredSameDir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		/*
+		 * Also track whether all relevant attributes from caller's tuple will
+		 * be equal to the scan's array keys once we're done with it
+		 */
+		if (result != 0)
+		{
+			all_arraylike_sk_satisfied = false;
+			if (requiredSameDir)
+				all_required_sk_satisfied = false;
+		}
+
+		/*
+		 * Optimization: If this call was triggered by a non-required array,
+		 * and we know that tuple won't satisfy the qual, we give up right
+		 * away.  This often avoids advancing the array keys, which will save
+		 * wasted cycles from calling _bt_update_keys_with_arraykeys below.
+		 */
+		if (!all_arraylike_sk_satisfied && !sktrigrequired)
+		{
+			Assert(!requiredSameDir && !foundRequiredOppositeDirOnly);
+			Assert(!beyond_end_advance);
+
+			break;
+		}
+
+		/* Advance array keys, even if set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+		}
+	}
+
+	/*
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is the only way that
+	 * the top-level index scan can be terminated here by us.
+	 */
+	arrays_exhausted = false;
+	if (beyond_end_advance)
+	{
+		/* Non-required scan keys never exhaust arrays/end top-level scan */
+		Assert(sktrigrequired && !all_required_sk_satisfied);
+
+		if (!_bt_advance_array_keys_increment(scan, dir))
+			arrays_exhausted = true;
+		else
+			arrays_advanced = true;
+	}
+
+	if (arrays_advanced)
+	{
+		/*
+		 * We advanced the array keys.  Finalize everything by performing an
+		 * in-place update of the scan's search-type scan keys.
+		 *
+		 * If we missed this final step then any call to _bt_check_compare
+		 * would use stale array keys until such time as _bt_preprocess_keys
+		 * was once again called by _bt_first.
+		 */
+		_bt_update_keys_with_arraykeys(scan);
+		so->advanceDir = dir;
+
+		/*
+		 * If any required array keys were advanced, be prepared to recheck
+		 * the final tuple against the new array keys (as an optimization)
+		 */
+		if (sktrigrequired)
+			pstate->finaltupchecked = false;
+	}
+
+	/*
+	 * If the array keys are now exhausted, end the top-level index scan
+	 */
+	Assert(!so->needPrimScan);
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	if (arrays_exhausted)
+	{
+		Assert(sktrigrequired && !all_required_sk_satisfied);
+
+		pstate->continuescan = false;
+
+		/* Caller's tuple can't match new qual (if any), either */
+		return false;
+	}
+
+	/*
+	 * Postcondition assertions (see header comments for a full explanation).
+	 *
+	 * Tuple must now be <= current/newly advanced required array keys.  Same
+	 * goes for other required equality type scan keys, which are "degenerate
+	 * single value arrays" for our purposes.  (As usual the rule is the same
+	 * for backwards scans once the operators are flipped around.)
+	 *
+	 * Every call here is guaranteed to advance (or exhaust) all required
+	 * arrays, with the sole exception of calls _bt_check_compare triggers
+	 * when it encounters an unsatisfied non-required array scan key.
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, pstate, tuple, 0, false) ==
+		   !all_required_sk_satisfied);
+	Assert(arrays_advanced || !sktrigrequired);
+	Assert(sktrigrequired || all_required_sk_satisfied);
+
+	/*
+	 * The array keys aren't exhausted, so provisionally assume that the
+	 * current primitive index scan will continue
+	 */
+	pstate->continuescan = true;
+
+	/*
+	 * Does caller's tuple now match the new qual?  Call _bt_check_compare a
+	 * second time to find out (unless it's already clear that it can't).
+	 */
+	if (all_arraylike_sk_satisfied && arrays_advanced)
+	{
+		bool		continuescan;
+		int			insktrig = sktrig + 1;
+
+		if (likely(_bt_check_compare(dir, so, tuple, ntupatts, itupdesc,
+									 so->numArrayKeys, &continuescan,
+									 &insktrig, false, false)))
+			return true;
+
+		/*
+		 * Handle inequalities marked required in the current scan direction.
+		 *
+		 * It's just about possible that our _bt_check_compare call indicates
+		 * that the scan should be terminated due to an unsatisfied inequality
+		 * that wasn't initially recognized as such by us.  Handle this by
+		 * calling ourselves recursively while indicating that the trigger is
+		 * now the inequality that we missed first time around.
+		 *
+		 * Note: we only need to do this in cases where the initial call to
+		 * _bt_check_compare (that led to calling here) gave up upon finding
+		 * an unsatisfied required equality/array scan key before it could
+		 * reach the inequality.  The second _bt_check_compare call took place
+		 * after the array keys were advanced (to array keys that definitely
+		 * match the tuple), so it can't have been overlooked a second time.
+		 *
+		 * Note: this is useful because we won't have to wait until the next
+		 * tuple to advance the array keys a second time (to values that'll
+		 * put the scan ahead of this tuple).  Handling this ourselves isn't
+		 * truly required.  But it avoids complicating our contract.  The only
+		 * alternative is to allow an awkward exception to the general rule
+		 * (the rule about always advancing the arrays to the maximum possible
+		 * extent that caller's tuple can safely allow).
+		 */
+		if (!continuescan)
+		{
+			ScanKey		inequal PG_USED_FOR_ASSERTS_ONLY = so->keyData + insktrig;
+
+			Assert(sktrigrequired && all_required_sk_satisfied);
+			Assert(inequal->sk_strategy != BTEqualStrategyNumber);
+			Assert(((inequal->sk_flags & SK_BT_REQFWD) &&
+					ScanDirectionIsForward(dir)) ||
+				   ((inequal->sk_flags & SK_BT_REQBKWD) &&
+					ScanDirectionIsBackward(dir)));
+
+			return _bt_advance_array_keys(scan, pstate, tuple, insktrig);
+		}
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 *
+	 * If we advanced the array keys (which is now certain except in the case
+	 * where we only needed to deal with non-required arrays), it's possible
+	 * that the scan is now at the start of "matching" tuples (at least by the
+	 * definition used by _bt_tuple_before_array_skeys), but is nevertheless
+	 * still many leaf pages before the position that _bt_first is capable of
+	 * repositioning the scan to.
+	 *
+	 * This can happen when we have an inequality scan key required in the
+	 * opposite direction only, that's less significant than the scan key that
+	 * triggered array advancement during our initial _bt_check_compare call.
+	 * If even finaltup doesn't satisfy this less significant inequality scan
+	 * key once we temporarily flip the scan direction, that indicates that
+	 * even finaltup is before the _bt_first-wise initial position for these
+	 * newly advanced array keys.
+	 */
+	if (all_required_sk_satisfied && foundRequiredOppositeDirOnly &&
+		pstate->finaltup)
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped = -dir;
+		bool		continuescan;
+		int			opsktrig = 0;
+
+		Assert(sktrigrequired && arrays_advanced);
+
+		_bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+						  itupdesc, so->numArrayKeys, &continuescan,
+						  &opsktrig, false, false);
+
+		if (!continuescan && opsktrig > sktrig)
+		{
+			ScanKey		inequal = so->keyData + opsktrig;
+
+			if (((inequal->sk_flags & SK_BT_REQFWD) &&
+				 ScanDirectionIsForward(flipped)) ||
+				((inequal->sk_flags & SK_BT_REQBKWD) &&
+				 ScanDirectionIsBackward(flipped)))
+			{
+				Assert(inequal->sk_strategy != BTEqualStrategyNumber);
+
+				/*
+				 * Continuing the ongoing primitive index scan as-is risks
+				 * uselessly scanning a huge number of leaf pages from before
+				 * the page that we'll quickly jump to by descending the index
+				 * anew.
+				 *
+				 * Play it safe: start a new primitive index scan.  _bt_first
+				 * is guaranteed to at least move the scan to the next leaf
+				 * page.
+				 */
+				pstate->continuescan = false;
+				so->needPrimScan = true;
+
+				return false;
+			}
+		}
+
+		/*
+		 * Caller's tuple might still be before the _bt_first-wise start of
+		 * matches for the new array keys, but at least finaltup is at or
+		 * ahead of that position.  That's good enough; continue as-is.
+		 */
+	}
+
+	/*
+	 * Caller's tuple is < the newly advanced array keys (or > when this is a
+	 * backwards scan).
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.  Our caller should locate
+	 * the first tuple >= the array keys before long (or locate the first
+	 * tuple <= the array keys before long).
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will either quickly discover the next tuple
+	 * covered by the current array keys, or quickly discover that it needs
+	 * another primitive index scan (using its finaltup precheck) instead.
+	 * Either way, a binary search is unlikely to beat a simple linear search.
+	 *
+	 * It's also not clear that a binary search will be any faster when we
+	 * really do have to search through hundreds of tuples beyond this one.
+	 * Several binary searches (one per array advancement) might be required
+	 * while reading through a single page.  Our linear search is structured
+	 * as one continuous search that just advances the arrays in passing, and
+	 * that only needs a little extra logic to deal with inequality scan keys.
+	 */
+	if (!all_required_sk_satisfied && tuple == pstate->finaltup)
+	{
+		/*
+		 * There is one exception: when the page's final tuple advances the
+		 * array keys without exactly matching keys for any required arrays,
+		 * start a new primitive index scan -- don't let our caller continue
+		 * to the next leaf page.
+		 *
+		 * In the forward scan case, finaltup is the page high key.  We don't
+		 * insist on having an exact match for truncated -inf attributes.
+		 * They're never exactly equal to any real array key, but it makes
+		 * sense to be optimistic about finding matches on the next page.
+		 */
+		Assert(sktrigrequired && arrays_advanced);
+
+		pstate->continuescan = false;
+		so->needPrimScan = true;
+	}
+
+	/* In any case, this indextuple doesn't match the qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -741,6 +1993,19 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here.  They
+ * expect to be able to call _bt_update_keys_with_arraykeys instead.  We need
+ * to be careful about that case when we determine redundancy; equality quals
+ * must not be eliminated as redundant on the basis of array input keys that
+ * might change before another call here can take place.
+ *
+ * Note, however, that the presence of an array scan key doesn't affect how we
+ * determine if index quals are contradictory.  Contradictory qual scans move
+ * on to the next primitive index scan right away, by incrementing the scan's
+ * array keys once control reaches _bt_array_keys_remain.  There won't be a
+ * call to _bt_update_keys_with_arraykeys, so there's nothing for us to break.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -887,8 +2152,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							so->qual_ok = false;
 							return;
 						}
-						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						else if (!(eq->sk_flags & SK_SEARCHARRAY))
+						{
+							/* else discard the redundant non-equality key */
+							xform[j] = NULL;
+						}
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -978,6 +2246,22 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
+		/*
+		 * Is this an array scan key that _bt_preprocess_array_keys merged
+		 * with some earlier array key during its initial preprocessing pass?
+		 */
+		if (cur->sk_flags & SK_BT_RDDNARRAY)
+		{
+			/*
+			 * key is redundant for this primitive index scan (and will be
+			 * redundant during all subsequent primitive index scans)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(j == (BTEqualStrategyNumber - 1));
+			Assert(so->numArrayKeys > 0);
+			continue;
+		}
+
 		/* have we seen one of these before? */
 		if (xform[j] == NULL)
 		{
@@ -991,7 +2275,26 @@ _bt_preprocess_keys(IndexScanDesc scan)
 										 &test_result))
 			{
 				if (test_result)
-					xform[j] = cur;
+				{
+					if (j == (BTEqualStrategyNumber - 1) &&
+						((xform[j]->sk_flags & SK_SEARCHARRAY) ||
+						 (cur->sk_flags & SK_SEARCHARRAY)))
+					{
+						/*
+						 * Must never replace an = array operator ourselves,
+						 * nor can we ever fail to remember an = array
+						 * operator.  _bt_update_keys_with_arraykeys expects
+						 * this.
+						 */
+						ScanKey		outkey = &outkeys[new_numberOfKeys++];
+
+						memcpy(outkey, cur, sizeof(ScanKeyData));
+						if (numberOfEqualCols == attno - 1)
+							_bt_mark_scankey_required(outkey);
+					}
+					else
+						xform[j] = cur;
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1019,6 +2322,95 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+/*
+ *	_bt_update_keys_with_arraykeys() -- Finalize advancing array keys
+ *
+ * Transfers newly advanced array keys that were set in "so->arrayKeyData[]"
+ * over to corresponding "so->keyData[]" scan keys.  Reuses most of the work
+ * that took place within _bt_preprocess_keys, only changing the array keys.
+ *
+ * It's safe to call here while holding a buffer lock, which isn't something
+ * that _bt_preprocess_keys can guarantee.
+ */
+static void
+_bt_update_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	Assert(so->qual_ok);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		Assert((cur->sk_flags & SK_BT_RDDNARRAY) == 0);
+
+		/* Just update equality array scan keys */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Update the scan key's argument */
+		Assert(cur->sk_attno == skeyarray->sk_attno);
+		cur->sk_argument = skeyarray->sk_argument;
+	}
+
+	Assert(arrayidx == so->numArrayKeys);
+}
+
+/*
+ * Verify that the scan's "so->arrayKeyData[]" scan keys are in agreement with
+ * the current "so->keyData[]" search-type scan keys.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Verify so->arrayKeyData[] input key has expected sk_argument */
+		if (skeyarray->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+
+		/* Verify so->arrayKeyData[] input key agrees with output key */
+		if (cur->sk_attno != skeyarray->sk_attno)
+			return false;
+		if (cur->sk_argument != skeyarray->sk_argument)
+			return false;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1352,60 +2744,211 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers call with a high key tuple last in the hopes of having
+ * us set pstate.continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.  Pass finaltup=true for these high key calls.
+ * Backwards scan callers shouldn't do this, but should still let us know
+ * which tuple is last by passing finaltup=true for the final non-pivot tuple
+ * (the non-pivot tuple at page offset number one).
+ *
+ * Callers with equality strategy array scan keys must set up page state that
+ * helps us know when to start or stop primitive index scans on their behalf.
+ * The finaltup tuple should be stashed in pstate.finaltup, so we don't have
+ * to wait until the finaltup call to be able to see what's up with the page.
+ *
+ * Advances the scan's array keys in passing when required.  Note that we rely
+ * on _bt_readpage calling here in page offset number order (for the current
+ * scan direction).  Any other order confuses array advancement.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
+ * finaltup: Is tuple the final one we'll be called with for this page?
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
+ * continuescanPrechecked: indicates that continuescan flag is known to
  * 						   be true for the last item on the page
  * haveFirstMatch: indicates that we already have at least one match
  * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup, int tupnatts,
 			  bool continuescanPrechecked, bool haveFirstMatch)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			numArrayKeys = so->numArrayKeys;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!numArrayKeys || so->advanceDir == pstate->dir);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so, tuple, tupnatts, tupdesc,
+							numArrayKeys, &pstate->continuescan, &ikey,
+							continuescanPrechecked, haveFirstMatch);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * continuescan=false.
+	 */
+	if (!numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.
+	 *
+	 * While we might really need to end the top-level index scan, most of the
+	 * time this just means that the scan needs to reconsider its array keys.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, pstate, tuple, ikey, true))
+	{
+		/*
+		 * Current tuple is < the current array scan keys/equality constraints
+		 * (or > in the backward scan case).  Don't need to advance the array
+		 * keys.  Must decide whether to start a new primitive scan instead.
+		 *
+		 * If this tuple isn't the finaltup for the page, then recheck the
+		 * finaltup stashed in pstate as an optimization.  That allows us to
+		 * quit scanning this page early when it's clearly hopeless (we don't
+		 * need to wait for the finaltup call to give up on a primitive scan).
+		 */
+		if (finaltup || (!pstate->finaltupchecked && pstate->finaltup &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->finaltup,
+													  0, false)))
+		{
+			/*
+			 * Give up on the ongoing primitive index scan.
+			 *
+			 * Even the final tuple (the high key for forward scans, or the
+			 * tuple from page offset number 1 for backward scans) is before
+			 * the current array keys.  That strongly suggests that continuing
+			 * this primitive scan would be less efficient than starting anew.
+			 *
+			 * See also: _bt_advance_array_keys's handling of the case where
+			 * finaltup itself advances the array keys to non-matching values.
+			 */
+			pstate->continuescan = false;
+
+			/*
+			 * Set up a new primitive index scan that will reposition the
+			 * top-level scan to the first leaf page whose key space is
+			 * covered by our array keys.  The top-level scan will "skip" a
+			 * part of the index that can only contain non-matching tuples.
+			 *
+			 * Note: the next primitive index scan is guaranteed to land on
+			 * some later leaf page (ideally it won't be this page's sibling).
+			 * It follows that the top-level scan can never access the same
+			 * leaf page more than once (unless the scan changes direction or
+			 * btrestrpos is called).  btcostestimate relies on this.
+			 */
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/*
+			 * Stick with the ongoing primitive index scan, for now (override
+			 * _bt_check_compare's suggestion that we end the scan).
+			 *
+			 * Note: we will end up here again and again given a group of
+			 * tuples > the previous array keys and < the now-current keys
+			 * (though only after an initial finaltup precheck determined that
+			 * this page definitely covers key space from both array keysets).
+			 * In effect, we perform a linear search of the page's remaining
+			 * unscanned tuples every time the arrays advance past the key
+			 * space of the scan's then-current tuple.
+			 */
+			pstate->continuescan = true;
+
+			/*
+			 * Our finaltup precheck determined that it is >= the current keys
+			 * (though the _current_ tuple is still < the current array keys).
+			 *
+			 * Remember that fact in pstate now.  This avoids wasting cycles
+			 * on repeating the same precheck step (checking the same finaltup
+			 * against the same array keys) during later calls here for later
+			 * tuples from this same leaf page.
+			 */
+			pstate->finaltupchecked = true;
+		}
+
+		/* In any case, this indextuple doesn't match the qual */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (unless the required array keys become exhausted instead, or
+	 * unless the ikey trigger corresponds to a non-required array scan key).
+	 *
+	 * Note: we might even advance the required arrays when all existing keys
+	 * are already equal to the values from the tuple at this point.  See the
+	 * comments above _bt_advance_array_keys about required-inequality-driven
+	 * array advancement.
+	 *
+	 * Note: we _won't_ advance any required arrays when the ikey/trigger scan
+	 * key corresponds to a non-required array found to be unsatisfied by the
+	 * current keys.  (We might not even "advance" the non-required array.)
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, ikey);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (which has more high
+ * level context than us) to override that initial determination when it makes
+ * more sense to advance the array keys and continue with further tuples from
+ * the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  int numArrayKeys, bool *continuescan, int *ikey,
+				  bool continuescanPrechecked, bool haveFirstMatch)
+{
+	ScanKey		key;
+	int			keysz;
+
+	Assert(!numArrayKeys || !continuescanPrechecked);
 
 	*continuescan = true;		/* default assumption */
-
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
 	keysz = so->numberOfKeys;
 
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = so->keyData + *ikey; *ikey < keysz; key++, (*ikey)++)
 	{
 		Datum		datum;
 		bool		isNull;
 		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1423,7 +2966,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
+		if ((requiredSameDir || (requiredOppositeDirOnly && haveFirstMatch)) &&
 			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
 			continue;
 
@@ -1435,7 +2978,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1525,12 +3067,29 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key checking function.  When the key is required for
+		 * opposite-direction scans it must be an inequality satisfied by
+		 * _bt_first(), barring NULLs, which we just checked a moment ago.
+		 *
+		 * (Also can't apply this optimization with scans that use arrays,
+		 * since _bt_advance_array_keys() sometimes allows the scan to see a
+		 * few tuples from before the would-be _bt_first() starting position
+		 * for the scan's just-advanced array keys.)
+		 *
+		 * Even required equality quals (that can't use this optimization due
+		 * to being required in both scan directions) rely on the assumption
+		 * that _bt_first() will always use the quals for initial positioning
+		 * purposes.  We stop the scan as soon as any required equality qual
+		 * fails, so it had better only happen at the end of equal tuples in
+		 * the current scan direction (never at the start of equal tuples).
+		 * See comments in _bt_first().
+		 *
+		 * (The required equality quals issue also has specific implications
+		 * for scans that use arrays.  They sometimes perform a linear search
+		 * of remaining unscanned tuples, forcing the primitive index scan to
+		 * continue until it locates tuples >= the scan's new array keys.)
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
+		if (!(requiredOppositeDirOnly && haveFirstMatch) || numArrayKeys)
 		{
 			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
 									 datum, key->sk_argument);
@@ -1548,15 +3107,25 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * Tuple fails this qual.  If it's a required qual for the current
 			 * scan direction, then we can conclude no further tuples will
 			 * pass, either.
-			 *
-			 * Note: because we stop the scan as soon as any required equality
-			 * qual fails, it is critical that equality quals be used for the
-			 * initial positioning in _bt_first() when they are available. See
-			 * comments in _bt_first().
 			 */
 			if (requiredSameDir)
 				*continuescan = false;
 
+			/*
+			 * Always set continuescan=false for equality-type array keys that
+			 * don't pass -- even for an array scan key not marked required.
+			 *
+			 * A non-required scan key (array or otherwise) can never actually
+			 * terminate the scan.  It's just convenient for callers to treat
+			 * continuescan=false as a signal that it might be time to advance
+			 * the array keys, independent of whether they're required or not.
+			 * (Even setting continuescan=false with a required scan key won't
+			 * usually end a scan that uses arrays.)
+			 */
+			if (numArrayKeys && (key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+				*continuescan = false;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1575,7 +3144,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1604,7 +3173,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbd..772c294f5 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +862,20 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			/*
+			 * We might need to omit ScalarArrayOpExpr clauses when index AM
+			 * lacks native support
+			 */
+			if (!index->amsearcharray && IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
+				if (skip_nonnative_saop)
 				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
+					/* Ignore because not supported by index */
+					*skip_nonnative_saop = true;
+					continue;
 				}
+				/* Caller had better intend this only for bitmap scan */
+				Assert(scantype == ST_BITMAPSCAN);
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d..47de61da1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6557,8 +6557,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6573,7 +6571,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6603,19 +6601,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6653,27 +6640,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6683,11 +6674,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6703,10 +6692,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6719,7 +6706,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6797,7 +6784,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6812,17 +6798,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6862,14 +6843,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				double		alength = estimate_array_length(root, other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6929,13 +6905,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6945,6 +6914,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6952,9 +6963,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +6982,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052..779a15df3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,7 +960,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1024,7 +1024,6 @@ typedef struct BTArrayKeyInfo
 {
 	int			scan_key;		/* index of associated key in arrayKeyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1038,13 +1037,14 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	ScanDirection advanceDir;	/* Scan direction when arrays last advanced */
+	bool		needPrimScan;	/* Need primscan to continue in advanceDir? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1075,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the final tuple from the page.  This must happen
+ * before the first call to _bt_checkkeys.  _bt_checkkeys uses the final tuple
+ * to manage advancement of the scan's array keys more efficiently.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* final tuple (high key for forward scans) */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		finaltupchecked;	/* final tuple checked against current
+									 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1082,6 +1105,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_RDDNARRAY	0x00040000	/* redundant in array preprocessing */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1152,7 +1176,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1245,13 +1269,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_rewind_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup, int tupnatts,
+						  bool continuescanPrechecked, bool haveFirstMatch);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 8311a03c3..d159091ab 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -434,3 +434,482 @@ ALTER INDEX btree_part_idx ALTER COLUMN id SET (n_distinct=100);
 ERROR:  ALTER action ALTER COLUMN ... SET cannot be performed on relation "btree_part_idx"
 DETAIL:  This operation is not supported for partitioned indexes.
 DROP TABLE btree_part;
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.
+create temp table backup_wrong_tbl (district int4, warehouse int4, orderid int4, orderline int4);
+create index backup_wrong_idx on backup_wrong_tbl (district, warehouse, orderid, orderline);
+insert into backup_wrong_tbl
+select district, warehouse, orderid, orderline
+from
+  generate_series(1, 3) district,
+  generate_series(1, 2) warehouse,
+  generate_series(1, 51) orderid,
+  generate_series(1, 10) orderline;
+begin;
+declare back_up_terminate_toplevel_wrong cursor for
+select * from backup_wrong_tbl
+where district in (1, 3) and warehouse in (1,2)
+and orderid in (48, 50)
+order by district, warehouse, orderid, orderline;
+fetch forward 60 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      48 |         1
+        1 |         1 |      48 |         2
+        1 |         1 |      48 |         3
+        1 |         1 |      48 |         4
+        1 |         1 |      48 |         5
+        1 |         1 |      48 |         6
+        1 |         1 |      48 |         7
+        1 |         1 |      48 |         8
+        1 |         1 |      48 |         9
+        1 |         1 |      48 |        10
+        1 |         1 |      50 |         1
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         6
+        3 |         1 |      48 |         7
+        3 |         1 |      48 |         8
+        3 |         1 |      48 |         9
+        3 |         1 |      48 |        10
+        3 |         1 |      50 |         1
+        3 |         1 |      50 |         2
+        3 |         1 |      50 |         3
+        3 |         1 |      50 |         4
+        3 |         1 |      50 |         5
+        3 |         1 |      50 |         6
+        3 |         1 |      50 |         7
+        3 |         1 |      50 |         8
+        3 |         1 |      50 |         9
+        3 |         1 |      50 |        10
+(60 rows)
+
+fetch backward 29 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      50 |         9
+        3 |         1 |      50 |         8
+        3 |         1 |      50 |         7
+        3 |         1 |      50 |         6
+        3 |         1 |      50 |         5
+        3 |         1 |      50 |         4
+        3 |         1 |      50 |         3
+        3 |         1 |      50 |         2
+        3 |         1 |      50 |         1
+        3 |         1 |      48 |        10
+        3 |         1 |      48 |         9
+        3 |         1 |      48 |         8
+        3 |         1 |      48 |         7
+        3 |         1 |      48 |         6
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+(29 rows)
+
+fetch forward 12 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+(12 rows)
+
+fetch backward 30 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+(30 rows)
+
+fetch forward  31 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+(31 rows)
+
+fetch backward 32 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         2
+(32 rows)
+
+fetch forward  33 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+(33 rows)
+
+fetch backward 34 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         1
+(34 rows)
+
+fetch forward  35 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         6
+(35 rows)
+
+commit;
+create temp table outer_table                  (a int, b int);
+create temp table restore_buggy_primscan_table (x int, y int);
+create index buggy_idx on restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+insert into outer_table                  select 1, b_vals from generate_series(1006, 1580) b_vals;
+insert into restore_buggy_primscan_table select 1, x_vals from generate_series(1006, 1580) x_vals;
+insert into outer_table                  select 1, 1370 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1371 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1380 from generate_series(1, 9) j;
+vacuum analyze outer_table;
+vacuum analyze restore_buggy_primscan_table;
+select count(*), o.a, o.b
+  from
+    outer_table o
+  inner join
+    restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x = 1 and
+  bug.y = any(array[(select array_agg(i) from generate_series(1370, 1390) i where i % 10 = 0)])
+group by o.a, o.b;
+ count | a |  b   
+-------+---+------
+    10 | 1 | 1370
+    10 | 1 | 1380
+     1 | 1 | 1390
+(3 rows)
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys().  This is handled like the case where the scan
+-- direction changes "within" a page, relying on code from _bt_readnextpage().
+create temp table outer_tab(
+  a int,
+  b int
+);
+create index outer_tab_idx on outer_tab(a, b) with (deduplicate_items = off);
+create temp table primscanmarkcov_table(
+  a int,
+  b int
+);
+create index interesting_coverage_idx on primscanmarkcov_table(a, b) with (deduplicate_items = off);
+insert into outer_tab             select 1, i from generate_series(1530, 1780) i;
+insert into primscanmarkcov_table select 1, i from generate_series(1530, 1780) i;
+insert into outer_tab             select 1, 1550 from generate_series(1, 200) i;
+insert into primscanmarkcov_table select 1, 1551 from generate_series(1, 200) i;
+vacuum analyze outer_tab;
+vacuum analyze primscanmarkcov_table ;
+with range_ints as ( select i from generate_series(1530, 1780) i)
+select
+  count(*), buggy.a, buggy.b from
+outer_tab o
+  inner join
+primscanmarkcov_table buggy
+  on o.a = buggy.a and o.b = buggy.b
+where
+  o.a = 1     and     o.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])  and
+  buggy.a = 1 and buggy.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])
+group by buggy.a, buggy.b
+order by buggy.a, buggy.b;
+ count | a |  b   
+-------+---+------
+   201 | 1 | 1550
+     1 | 1 | 1600
+     1 | 1 | 1650
+     1 | 1 | 1700
+     1 | 1 | 1750
+(5 rows)
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys() for backwards scans.  More or less comparable to
+-- the last test.
+create temp table backwards_prim_outer_table             (a int, b int);
+create temp table backwards_restore_buggy_primscan_table (x int, y int);
+create index backward_prim_buggy_idx  on backwards_restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+create index backwards_prim_drive_idx on backwards_prim_outer_table             (a, b) with (deduplicate_items=off);
+insert into backwards_prim_outer_table                  select 0, 1360;
+insert into backwards_prim_outer_table                  select 1, b_vals from generate_series(1012, 1406) b_vals where b_vals % 10 = 0;
+insert into backwards_prim_outer_table                  select 1, 1370;
+vacuum analyze backwards_prim_outer_table; -- Be tidy
+-- Fill up "backwards_prim_drive_idx" index with 396 items, just about fitting
+-- onto its only page, which is a root leaf page:
+insert into backwards_restore_buggy_primscan_table select 0, 1360;
+insert into backwards_restore_buggy_primscan_table select 1, x_vals from generate_series(1012, 1406) x_vals;
+vacuum analyze backwards_restore_buggy_primscan_table; -- Be tidy
+-- Now cause two page splits, leaving 4 leaf pages in total:
+insert into backwards_restore_buggy_primscan_table select 1, 1370 from generate_series(1,250) i;
+-- Now "buggy" index looks like this:
+--
+-- ┌───┬───────┬───────┬────────┬────────┬────────────┬───────┬───────┬───────────────────┬─────────┬───────────┬──────────────────┐
+-- │ i │ blkno │ flags │ nhtids │ nhblks │ ndeadhblks │ nlive │ ndead │ nhtidschecksimple │ avgsize │ freespace │     highkey      │
+-- ├───┼───────┼───────┼────────┼────────┼────────────┼───────┼───────┼───────────────────┼─────────┼───────────┼──────────────────┤
+-- │ 1 │     1 │     1 │    203 │      1 │          0 │   204 │     0 │                 0 │      16 │     4,068 │ (x, y)=(1, 1214) │
+-- │ 2 │     4 │     1 │    156 │      2 │          0 │   157 │     0 │                 0 │      16 │     5,008 │ (x, y)=(1, 1370) │
+-- │ 3 │     5 │     1 │    251 │      2 │          0 │   252 │     0 │                 0 │      16 │     3,108 │ (x, y)=(1, 1371) │
+-- │ 4 │     2 │     1 │     36 │      1 │          0 │    36 │     0 │                 0 │      16 │     7,428 │ ∅                │
+-- └───┴───────┴───────┴────────┴────────┴────────────┴───────┴───────┴───────────────────┴─────────┴───────────┴──────────────────┘
+select count(*), o.a, o.b
+  from
+    backwards_prim_outer_table o
+  inner join
+    backwards_restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x in (0, 1) and
+  bug.y = any(array[(select array_agg(i) from generate_series(1360, 1370) i where i % 10 = 0)])
+group by o.a, o.b
+order by o.a desc, o.b desc;
+ count | a |  b   
+-------+---+------
+   502 | 1 | 1370
+     1 | 1 | 1360
+     1 | 0 | 1360
+(3 rows)
+
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 79fa117cb..f5865494c 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1951,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index a2fad81d7..3a0456746 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8837,10 +8837,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index ef8435423..330edbb1d 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -267,3 +267,150 @@ CREATE TABLE btree_part (id int4) PARTITION BY RANGE (id);
 CREATE INDEX btree_part_idx ON btree_part(id);
 ALTER INDEX btree_part_idx ALTER COLUMN id SET (n_distinct=100);
 DROP TABLE btree_part;
+
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.
+create temp table backup_wrong_tbl (district int4, warehouse int4, orderid int4, orderline int4);
+create index backup_wrong_idx on backup_wrong_tbl (district, warehouse, orderid, orderline);
+insert into backup_wrong_tbl
+select district, warehouse, orderid, orderline
+from
+  generate_series(1, 3) district,
+  generate_series(1, 2) warehouse,
+  generate_series(1, 51) orderid,
+  generate_series(1, 10) orderline;
+
+begin;
+declare back_up_terminate_toplevel_wrong cursor for
+select * from backup_wrong_tbl
+where district in (1, 3) and warehouse in (1,2)
+and orderid in (48, 50)
+order by district, warehouse, orderid, orderline;
+
+fetch forward 60 from back_up_terminate_toplevel_wrong;
+fetch backward 29 from back_up_terminate_toplevel_wrong;
+fetch forward 12 from back_up_terminate_toplevel_wrong;
+fetch backward 30 from back_up_terminate_toplevel_wrong;
+fetch forward  31 from back_up_terminate_toplevel_wrong;
+fetch backward 32 from back_up_terminate_toplevel_wrong;
+fetch forward  33 from back_up_terminate_toplevel_wrong;
+fetch backward 34 from back_up_terminate_toplevel_wrong;
+fetch forward  35 from back_up_terminate_toplevel_wrong;
+commit;
+
+create temp table outer_table                  (a int, b int);
+create temp table restore_buggy_primscan_table (x int, y int);
+
+create index buggy_idx on restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+
+insert into outer_table                  select 1, b_vals from generate_series(1006, 1580) b_vals;
+insert into restore_buggy_primscan_table select 1, x_vals from generate_series(1006, 1580) x_vals;
+
+insert into outer_table                  select 1, 1370 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1371 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1380 from generate_series(1, 9) j;
+
+vacuum analyze outer_table;
+vacuum analyze restore_buggy_primscan_table;
+
+select count(*), o.a, o.b
+  from
+    outer_table o
+  inner join
+    restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x = 1 and
+  bug.y = any(array[(select array_agg(i) from generate_series(1370, 1390) i where i % 10 = 0)])
+group by o.a, o.b;
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys().  This is handled like the case where the scan
+-- direction changes "within" a page, relying on code from _bt_readnextpage().
+create temp table outer_tab(
+  a int,
+  b int
+);
+create index outer_tab_idx on outer_tab(a, b) with (deduplicate_items = off);
+
+create temp table primscanmarkcov_table(
+  a int,
+  b int
+);
+create index interesting_coverage_idx on primscanmarkcov_table(a, b) with (deduplicate_items = off);
+
+insert into outer_tab             select 1, i from generate_series(1530, 1780) i;
+insert into primscanmarkcov_table select 1, i from generate_series(1530, 1780) i;
+
+insert into outer_tab             select 1, 1550 from generate_series(1, 200) i;
+insert into primscanmarkcov_table select 1, 1551 from generate_series(1, 200) i;
+
+vacuum analyze outer_tab;
+vacuum analyze primscanmarkcov_table ;
+
+with range_ints as ( select i from generate_series(1530, 1780) i)
+
+select
+  count(*), buggy.a, buggy.b from
+outer_tab o
+  inner join
+primscanmarkcov_table buggy
+  on o.a = buggy.a and o.b = buggy.b
+where
+  o.a = 1     and     o.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])  and
+  buggy.a = 1 and buggy.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])
+group by buggy.a, buggy.b
+order by buggy.a, buggy.b;
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys() for backwards scans.  More or less comparable to
+-- the last test.
+create temp table backwards_prim_outer_table             (a int, b int);
+create temp table backwards_restore_buggy_primscan_table (x int, y int);
+
+create index backward_prim_buggy_idx  on backwards_restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+create index backwards_prim_drive_idx on backwards_prim_outer_table             (a, b) with (deduplicate_items=off);
+
+insert into backwards_prim_outer_table                  select 0, 1360;
+insert into backwards_prim_outer_table                  select 1, b_vals from generate_series(1012, 1406) b_vals where b_vals % 10 = 0;
+insert into backwards_prim_outer_table                  select 1, 1370;
+vacuum analyze backwards_prim_outer_table; -- Be tidy
+
+-- Fill up "backwards_prim_drive_idx" index with 396 items, just about fitting
+-- onto its only page, which is a root leaf page:
+insert into backwards_restore_buggy_primscan_table select 0, 1360;
+insert into backwards_restore_buggy_primscan_table select 1, x_vals from generate_series(1012, 1406) x_vals;
+vacuum analyze backwards_restore_buggy_primscan_table; -- Be tidy
+
+-- Now cause two page splits, leaving 4 leaf pages in total:
+insert into backwards_restore_buggy_primscan_table select 1, 1370 from generate_series(1,250) i;
+
+-- Now "buggy" index looks like this:
+--
+-- ┌───┬───────┬───────┬────────┬────────┬────────────┬───────┬───────┬───────────────────┬─────────┬───────────┬──────────────────┐
+-- │ i │ blkno │ flags │ nhtids │ nhblks │ ndeadhblks │ nlive │ ndead │ nhtidschecksimple │ avgsize │ freespace │     highkey      │
+-- ├───┼───────┼───────┼────────┼────────┼────────────┼───────┼───────┼───────────────────┼─────────┼───────────┼──────────────────┤
+-- │ 1 │     1 │     1 │    203 │      1 │          0 │   204 │     0 │                 0 │      16 │     4,068 │ (x, y)=(1, 1214) │
+-- │ 2 │     4 │     1 │    156 │      2 │          0 │   157 │     0 │                 0 │      16 │     5,008 │ (x, y)=(1, 1370) │
+-- │ 3 │     5 │     1 │    251 │      2 │          0 │   252 │     0 │                 0 │      16 │     3,108 │ (x, y)=(1, 1371) │
+-- │ 4 │     2 │     1 │     36 │      1 │          0 │    36 │     0 │                 0 │      16 │     7,428 │ ∅                │
+-- └───┴───────┴───────┴────────┴────────┴────────────┴───────┴───────┴───────────────────┴─────────┴───────────┴──────────────────┘
+
+select count(*), o.a, o.b
+  from
+    backwards_prim_outer_table o
+  inner join
+    backwards_restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x in (0, 1) and
+  bug.y = any(array[(select array_agg(i) from generate_series(1360, 1370) i where i % 10 = 0)])
+group by o.a, o.b
+order by o.a desc, o.b desc;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..9d68ef624 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,18 +774,14 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.43.0

#37

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#35)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, 16 Jan 2024 at 03:03, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Can you pull these planner changes into their own commit(s)?
As mentioned upthread, it's a significant change in behavior that
should have separate consideration and reference in the commit log. I
really don't think it should be buried in the 5th paragraph of an
"Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
changes of btree are arguably independent of the planner changes, as
the btree changes improve performance even if we ignore that it
implements strict result ordering.

I'm not going to break out the planner changes, because they're *not*
independent in any way.

The planner changes depend on the btree changes, that I agree with.
However, I don't think that the btree changes depend on the planner
changes.

You could say the same thing about practically
any work that changes the planner. They're "buried" in the 5th
paragraph of the commit message. If an interested party can't even
read that far to gain some understanding of a legitimately complicated
piece of work such as this, I'm okay with that.

I would agree with you if this was about new APIs and features, but
here existing APIs are being repurposed without changing them. A
maintainer of index AMs would not look at the commit title 'Enhance
nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
my existing amsearcharray+amcanorder handling actually supports
non-prefix arrays keys and return data in index order".
There are also no changes in amapi.h that would signal any index AM
author that expectations have changed. I really don't think you can
just ignore all that, and I believe this to also be the basis of
Heikki's first comment.

As I said to Heikki, thinking about this some more is on my todo list.
I mean the way that this worked had substantial revisions on HEAD
right before I posted v9. v9 was only to fix the bit rot that that
caused.

Then I'll be waiting for that TODO item to be resolved.

+++ b/src/backend/access/nbtree/nbtutils.c
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
As I mentioned upthread, I find this function to be very wasteful, as
it uses N binary searches to merge join two already sorted arrays,
resulting in a O(n log(m)) complexity, whereas a normal merge join
should be O(n + m) once the input datasets are sorted.
And as I mentioned upthread, I think that you're making a mountain out
of a molehill here. This is not a merge join.

We're merging two sorted sets and want to retain only the matching
entries, while retaining the order of the data. AFAIK the best
algorithm available for this is a sort-merge join.
Sure, it isn't a MergeJoin plan node, but that's not what I was trying to argue.

Even single digit
thousand of array elements should be considered huge. Plus this code
path is only hit with poorly written queries.

Would you accept suggestions?

Please fix this, as it shows up in profiling of large array merges.
Additionally, as it merges two arrays of unique items into one,
storing only matching entries, I feel that it is quite wasteful to do
this additional allocation here. Why not reuse the original allocation
immediately?

Because that's adding more complexity for a code path that will hardly
ever be used in practice. For something that happens once, during a
preprocessing step.

Or many times, when we're in a parameterized loop, as was also pointed
out by Heikki. While I do think it is rare, the existence of this path
that merges these arrays implies the need for merging these arrays,
which thus

+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+                             IndexTuple tuple, int sktrig, bool validtrig)
I don't quite understand what the 'validtrig' argument is used for.
There is an assertion that it is false under some conditions in this
code, but it's not at all clear to me why that would have to be the
case - it is called with `true` in one of the three call sites. Could
the meaning of this be clarified?
Sure, I'll add a comment.

Thanks!

I also feel that this patch includes several optimizations such as
this sktrig argument which aren't easy to understand. Could you pull
that into a separately reviewable patch?

It probably makes sense to add the extra preprocessing stuff out into
its own commit, since I tend to agree that that's an optimization that
can be treated as unrelated (and isn't essential to the main thrust of
the patch).

However, the sktrig thing isn't really like that. We need to do things
that way for required inequality scan keys. It doesn't make sense to
not just do it for all required scan keys (both equality and
inequality strategy scan keys) right from the start.

I understand that in some places the "triggered by scankey" concept is
required, but in other places the use of it is explicitly flagged as
an optimization (specifically, in _bt_advance_array_keys), which
confused me.

Additionally, could you try to create a single point of entry for the
array key stuff that covers the new systems? I've been trying to wrap
my head around this, and it's taking a lot of effort.

I don't understand what you mean here.

The documentation (currently mostly code comments) is extensive, but
still spread around various inline comments and comments on functions;
with no place where a clear top-level design is detailed.
I'll agree that we don't have that for the general systems in
_bt_next() either, but especially with this single large patch it's
difficult to grasp the specific differences between the various
functions.

_bt_advance_array_keys

Thinking about the implementation here:
We require transitivity for btree opclasses, where A < B implies NOT A
= B, etc. Does this also carry over into cross-type operators?

Yes, it carries like that.

Would it be valid to add support methods for truncatedint4 to an int4
btree opclass, or is transitivity also required for all operations?
i.e. all values that one operator class considers unique within an
opfamily must be considered unique for all additional operators in the
opfamily, or is that not required?
If not, then that would pose problems for this patch, as the ordering
of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
('{0,65536}'::truncatedint4[]) could potentially skip results.

There are roughly two ways to deal with this sort of thing (which
sounds like a restatement of the issue shown by your test case?).

It wasn't really meant as a restatement: It is always unsafe to ignore
upper bits, as the index isn't organized by that. However, it *could*
be safe to ignore the bits with lowest significance, as the index
would still be organized correctly even in that case, for int4 at
least. Similar to how you can have required scankeys for the prefix of
an (int2, int2) index, but not the suffix (as of right now at least).

The issue I was trying to show is that if you have a type that ignores
some part of the key for comparison like truncatedint4 (which
hypothetically would do ((a>>16) < (b>>16)) on int4 types), then this
might cause issues if that key was advanced before more precise
equality checks.

This won't ever be an issue when there is a requirement that if a = b
and b = c then a = c must hold for all triples of typed values and
operations in a btree opclass, as you mention above.

They
are:

1. Add support for detecting redundant scan keys, even when cross-type
operators are involved. (Discussed earlier.)

2. Be prepared to have more than one scan key per index key column in
rare edge-cases. This means that I can no longer subscript
so->orderProcs using an attnum.

Actually...I might have to use a hybrid of 1 and 2. Because we need to
be prepared for the possibility that it just isn't possible to
determine redundancy at all due to a lack of suitable cross-type
operators -- I don't think that it would be okay to just throw an
error there (that really would be a regression against Postgres 16).

Agreed.

I'm also no fan of the (tail) recursion. I would agree that this is
unlikely to consume a lot of stack, but it does consume stackspace
nonetheless, and I'd prefer if it was not done this way.

I disagree. The amount of stack space it consumes in the worst case is fixed.

Is it fixed? Without digging very deep into it, it looks like it can
iterate on the order of n_scankeys deep? The comment above does hint
on 2 iterations, but is not very clear about the conditions and why.

I notice an assertion error here:
+            Assert(cur->sk_strategy != BTEqualStrategyNumber);
+            Assert(all_required_sk_satisfied);
+            Assert(!foundRequiredOppositeDirOnly);
+
+            foundRequiredOppositeDirOnly = true;
This assertion can be hit with the following test case:

CREATE TABLE test AS
SELECT i a, i b, i c FROM generate_series(1, 1000) i;
CREATE INDEX ON test(a, b, c); ANALYZE;
SELECT count(*) FROM test
WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
AND b = ANY ('{1,2,3}');
Will fix.

+_bt_update_keys_with_arraykeys(IndexScanDesc scan)

I keep getting confused by the mixing of integer increments and
pointer increments. Could you explain why in this code you chose to
increment a pointer for "ScanKey cur", while using arrray indexing for
other fields? It feels very arbitrary to me, and that makes the code
difficult to follow.

Because in one case we follow the convention for scan keys, whereas
so->orderProcs is accessed via an attribute number subscript.

Okay, but how about this in _bt_merge_arrays?

+ Datum *elem = elems_orig + i;

I'm not familiar with the scan key convention, as most other places
use reference+subscripting.

+++ b/src/test/regress/sql/btree_index.sql
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.
I notice this adds a complex test case that outputs many rows. Can we
do with less rows if we build the index after data insertion, and with
a lower (non-default) fillfactor?
Probably not. It was actually very hard to come up with these test
cases, which tickle the implementation in just the right way to
demonstrate that the code in places like _bt_steppage() is actually
required. It took me a rather long time to just prove that much. Not
sure that we really need this. But thought I'd include it for the time
being, just so that reviewers could understand those changes.

Makes sense, thanks for the explanation.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#38

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#37)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I'm not going to break out the planner changes, because they're *not*
independent in any way.

The planner changes depend on the btree changes, that I agree with.
However, I don't think that the btree changes depend on the planner
changes.

Yeah, technically it would be possible to break out the indxpath.c
changes -- that wouldn't leave the tree in a state with visibly-wrong
behavior at any point. But that's far from the only thing that
matters. If that was the standard that we applied, then I might have
to split the patch into 10 or more patches.

What it boils down to is this: it is totally natural for me to think
of the planner changes as integral to the nbtree changes, so I'm going
to include them together in one commit. That's just how the code was
written -- I always thought about it as one single thing. The majority
of the queries that I've used to promote the patch directly rely on
the planner changes in one way or another (even back in v1, when the
planner side of things had lots of kludges).

I don't necessarily always follow that standard. Sometimes it is
useful to further break things up. Like when it happens to make the
high-level division of labor a little bit clearer. The example that
comes to mind is commit dd299df8 and commit fab25024. These nbtree
commits were essentially one piece of work that was broken into two,
but committed within minutes of each other, and left the tree in a
momentary state that was not-very-useful (though still correct). That
made sense to me at the time because the code in question was
mechanically distinct, while at the same time modifying some of the
same nbtree files. Drawing a line under it seemed to make sense.

I will admit that there is some amount of subjective judgement (gasp!)
here. Plus I'll acknowledge that it's *slightly* odd that the most
compelling cases for the SAOP patch almost all involve savings in heap
page accesses, even though it is fundamentally a B-Tree patch. But
that's just how it came out. Slightly odd things happen all the time.

I would agree with you if this was about new APIs and features, but
here existing APIs are being repurposed without changing them. A
maintainer of index AMs would not look at the commit title 'Enhance
nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
my existing amsearcharray+amcanorder handling actually supports
non-prefix arrays keys and return data in index order".

This is getting ridiculous.

It is quite likely that there are exactly zero affected out-of-core
index AMs. I don't count pgroonga as a counterexample (I don't think
that it actually fullfills the definition of a ). Basically,
"amcanorder" index AMs more or less promise to be compatible with
nbtree, down to having the same strategy numbers. So the idea that I'm
going to upset amsearcharray+amcanorder index AM authors is a
completely theoretical problem. The planner code evolved with nbtree,
hand-in-glove.

I'm more than happy to add a reminder to the commit message about
adding a proforma listing to the compatibility section of the Postgres
17 release notes. Can we actually agree on which theoretical index AM
types are affected first, though? Isn't it actually
amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
have that right?

BTW, how should we phrase this compatibility note, so that it can be
understood? It's rather awkward.

As I said to Heikki, thinking about this some more is on my todo list.
I mean the way that this worked had substantial revisions on HEAD
right before I posted v9. v9 was only to fix the bit rot that that
caused.

Then I'll be waiting for that TODO item to be resolved.

My TODO item is to resolve the question of whether and to what extent
these two optimizations should be combined. It's possible that I'll
decide that it isn't worth it, for whatever reason. At this point I'm
still totally neutral.

Even single digit
thousand of array elements should be considered huge. Plus this code
path is only hit with poorly written queries.

Would you accept suggestions?

You mean that you want to have a go at it yourself? Sure, go ahead.

I cannot promise that I'll accept your suggested revisions, but if
they aren't too invasive/complicated compared to what I have now, then
I will just accept them. I maintain that this isn't really necessary,
but it might be the path of least resistance at this point.

Please fix this, as it shows up in profiling of large array merges.
Additionally, as it merges two arrays of unique items into one,
storing only matching entries, I feel that it is quite wasteful to do
this additional allocation here. Why not reuse the original allocation
immediately?

Because that's adding more complexity for a code path that will hardly
ever be used in practice. For something that happens once, during a
preprocessing step.

Or many times, when we're in a parameterized loop, as was also pointed
out by Heikki.

That's not really what Heikki said. Heikki had a general concern about
the startup costs with nestloop joins.

Additionally, could you try to create a single point of entry for the
array key stuff that covers the new systems? I've been trying to wrap
my head around this, and it's taking a lot of effort.

I don't understand what you mean here.

The documentation (currently mostly code comments) is extensive, but
still spread around various inline comments and comments on functions;
with no place where a clear top-level design is detailed.
I'll agree that we don't have that for the general systems in
_bt_next() either, but especially with this single large patch it's
difficult to grasp the specific differences between the various
functions.

Between what functions? The guts of this work are in the new
_bt_advance_array_keys and _bt_tuple_before_array_skeys (honorable
mention for _bt_array_keys_remain). It seems pretty well localized to
me.

Granted, there are a few places where we rely on things being set up a
certain way by other code that's quite some distance away (from the
code doing the depending). For example, the new additions to
_bt_preprocess_keys that we need later on, in
_bt_advance_array_keys/_bt_update_keys_with_arraykeys. These bits and
pieces of code are required, but are not particularly crucial to
understanding the general design.

For the most part the design is that we cede control of array key
advancement to _bt_checkkeys() -- nothing much else changes. There are
certainly some tricky details behind the scenes, which we should be
verifying via testing and especially via robust invariants (verified
with assertions). But this is almost completely hidden from the
nbtsearch.c caller -- there are no real changes required there.

It wasn't really meant as a restatement: It is always unsafe to ignore
upper bits, as the index isn't organized by that. However, it *could*
be safe to ignore the bits with lowest significance, as the index
would still be organized correctly even in that case, for int4 at
least. Similar to how you can have required scankeys for the prefix of
an (int2, int2) index, but not the suffix (as of right now at least).

It isn't safe to assume that types appearing together within an
opfamily are binary coercible (or capable of being type-punned like
this) in the general case. That often works, but it isn't reliable.
Even with integer_ops, it breaks on big-endian machines.

This won't ever be an issue when there is a requirement that if a = b
and b = c then a = c must hold for all triples of typed values and
operations in a btree opclass, as you mention above.

Right. It also doesn't matter if there are redundant equality
conditions that we cannot formally prove are redundant during
preprocessing, for want of appropriate cross-type comparison routines
from the opfamily. The actual behavior of index scans (in terms of
when and how we skip) won't be affected by that at all. The only
problem that it creates is that we'll waste CPU cycles, relative to
the case where we can somehow detect this redundancy.

In case it wasn't clear before: the optimizations I've added to
_bt_preprocess_array_keys are intended to compensate for the
pessimizations added to _bt_preprocess_keys (both the optimizations
and the pessimizations only affect equality type array scan keys). I
don't think that this needs to be perfect; it just seems like a good
idea.

I'm also no fan of the (tail) recursion. I would agree that this is
unlikely to consume a lot of stack, but it does consume stackspace
nonetheless, and I'd prefer if it was not done this way.

I disagree. The amount of stack space it consumes in the worst case is fixed.

Is it fixed? Without digging very deep into it, it looks like it can
iterate on the order of n_scankeys deep? The comment above does hint
on 2 iterations, but is not very clear about the conditions and why.

The recursive call to _bt_advance_array_keys is needed to deal with
unsatisfied-and-required inequalities that were not detected by an
initial call to _bt_check_compare, following a second retry call to
_bt_check_compare. We know that this recursive call to
_bt_advance_array_keys won't cannot recurse again because we've
already identified which specific inequality scan key it is that's a
still-unsatisfied inequality, following an initial round of array key
advancement.

We're pretty much instructing _bt_advance_array_keys to perform
"beyond_end_advance" type advancement for the specific known
unsatisfied inequality scan key (which must be required in the current
scan direction, per the assertions for that) here. So clearly it
cannot end up recursing any further -- the recursive call is
conditioned on "all_arraylike_sk_satisfied && arrays_advanced", but
that'll be unset when "ikey == sktrig && !array" from inside the loop.

This is a narrow edge-case -- it is an awkward one. Recursion does
seem natural here, because we're essentially repeating the call to
_bt_advance_array_keys from inside _bt_advance_array_keys, rather than
leaving it up to the usual external caller to do later on. If you
ripped this code out it would barely be noticeable at all. But it
seems worth it so that we can make a uniform promise to always advance
the array keys to the maximum extent possible, based on what the
caller's tuple tells us about the progress of the scan.

Since all calls to _bt_advance_array_keys are guaranteed to advance
the array keys to the maximum extent that's safely possible (barring
this one edge-case with recursive calls), it almost follows that there
can't be another recursive call. This is the one edge case where the
implementation requires a second pass over the tuple -- we advanced
the array keys to all-matching values, but nevertheless couldn't
finish there due to the unsatisfied required inequality (which must
have become unsatisfied _at the same time_ as some earlier required
equality scan key for us to end up requiring a recursive call).

Because in one case we follow the convention for scan keys, whereas
so->orderProcs is accessed via an attribute number subscript.

Okay, but how about this in _bt_merge_arrays?

+ Datum *elem = elems_orig + i;

I'm not familiar with the scan key convention, as most other places
use reference+subscripting.

I meant the convention used in code like _bt_check_compare (which is
what we call _bt_checkkeys on HEAD, basically).

Note that the _bt_merge_arrays code that you've highlighted isn't
iterating through so->keyData[] -- it is iterating through the
function caller's elements array, which actually come from
so->arrayKeys[].

Like every other Postgres contributor, I do my best to follow the
conventions established by existing code. Sometimes that leads to
pretty awkward results, where CamelCase and underscore styles are
closely mixed together, because it works out to be the most consistent
way of doing it overall.

--
Peter Geoghegan

#39

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#38)

2 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan <pg@bowt.ie> wrote:

Thank you for your replies so far.

On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I would agree with you if this was about new APIs and features, but
here existing APIs are being repurposed without changing them. A
maintainer of index AMs would not look at the commit title 'Enhance
nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
my existing amsearcharray+amcanorder handling actually supports
non-prefix arrays keys and return data in index order".

This is getting ridiculous.

It is quite likely that there are exactly zero affected out-of-core
index AMs. I don't count pgroonga as a counterexample (I don't think
that it actually fullfills the definition of a ). Basically,
"amcanorder" index AMs more or less promise to be compatible with
nbtree, down to having the same strategy numbers. So the idea that I'm
going to upset amsearcharray+amcanorder index AM authors is a
completely theoretical problem. The planner code evolved with nbtree,
hand-in-glove.

And this is where I disagree with your (percieved) implicit argument
that this should be and always stay this way. I don't mind changes in
the planner for nbtree per se, but as I've mentioned before in other
places, I really don't like how we handle amcanorder as if it were
amisbtree. But it's not called that, so we shouldn't expect that to
remain the case; and definitely not keep building on those
expectations that it always is going to be the nbtree when amcanorder
is set (or amsearcharray is set, or ..., or any combination of those
that is currently used by btree). By keeping that expectation alive,
this becomes a self-fulfilling prophecy, and I really don't like such
restrictive self-fulfilling prophecies. It's nice that we have index
AM feature flags, but if we only effectively allow one implementation
for this set of flags by ignoring potential other users when changing
behavior, then what is the API good for (apart from abstraction
layering, which is nice)?

/rant

I'll see about the

I'm more than happy to add a reminder to the commit message about
adding a proforma listing to the compatibility section of the Postgres
17 release notes. Can we actually agree on which theoretical index AM
types are affected first, though? Isn't it actually
amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
have that right?

I think that may be right, but I could very well have built a
btree-lite that doesn't have the historical baggage of having to deal
with pages from before v12 (version 4) and some improvements that
haven't made it to core yet.

BTW, how should we phrase this compatibility note, so that it can be
understood? It's rather awkward.

Something like the following, maybe?

"""
Compatibility: The planner will now generate paths with array scan
keys in any column for ordered indexes, rather than on only a prefix
of the index columns. The planner still expects fully ordered data
from those indexes.
Historically, the btree AM couldn't output correctly ordered data for
suffix array scan keys, which was "fixed" by prohibiting the planner
from generating array scan keys without an equality prefix scan key up
to that attribute. In this commit, the issue has been fixed, and the
planner restriction has thus been removed as the only in-core IndexAM
that reports amcanorder now supports array scan keys on any column
regardless of what prefix scan keys it has.
"""

As I said to Heikki, thinking about this some more is on my todo list.
I mean the way that this worked had substantial revisions on HEAD
right before I posted v9. v9 was only to fix the bit rot that that
caused.

Then I'll be waiting for that TODO item to be resolved.

My TODO item is to resolve the question of whether and to what extent
these two optimizations should be combined. It's possible that I'll
decide that it isn't worth it, for whatever reason.

That's fine, this decision (and any work related to it) is exactly
what I was referring to with that mention of the TODO.

Even single digit
thousand of array elements should be considered huge. Plus this code
path is only hit with poorly written queries.

Would you accept suggestions?

You mean that you want to have a go at it yourself? Sure, go ahead.

I cannot promise that I'll accept your suggested revisions, but if
they aren't too invasive/complicated compared to what I have now, then
I will just accept them. I maintain that this isn't really necessary,
but it might be the path of least resistance at this point.

Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental
on top of your earlier set, and both don't allocate additional memory
in the merge operation in non-assertion builds.

patch-a is a trivial and clean implementation of mergesort, which
tends to reduce the total number of compare operations if both arrays
are of similar size and value range. It writes data directly back into
the main array on non-assertion builds, and with assertions it reuses
your binary search join on scratch space for algorithm validation.

patch-b is an improved version of patch-a with reduced time complexity
in some cases: It applies exponential search to reduce work done where
there are large runs of unmatched values in either array, and does
that while keeping O(n+m) worst-case complexity, now getting a
best-case O(log(n)) complexity.

I'm also no fan of the (tail) recursion. I would agree that this is
unlikely to consume a lot of stack, but it does consume stackspace
nonetheless, and I'd prefer if it was not done this way.

I disagree. The amount of stack space it consumes in the worst case is fixed.

Is it fixed? Without digging very deep into it, it looks like it can
iterate on the order of n_scankeys deep? The comment above does hint
on 2 iterations, but is not very clear about the conditions and why.

The recursive call to _bt_advance_array_keys is needed to deal with
unsatisfied-and-required inequalities that were not detected by an
initial call to _bt_check_compare, following a second retry call to
_bt_check_compare. We know that this recursive call to
_bt_advance_array_keys won't cannot recurse again because we've
already identified which specific inequality scan key it is that's a
still-unsatisfied inequality, following an initial round of array key
advancement.

So, it's a case of this?

scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY
(2 (=current), 3, 4)
tuple: a=2, b=4

We first match the 'a' inequality key, then find an array-key mismatch
breaking out, then move the array keys forward to a=2 (of ANY
(1,2,3)), b=4 (of ANY(2, 3, 4)); and now we have to recheck the later
b < 4 inequality key, as that required inequality key wasn't checked
because the earlier arraykey did not match in _bt_check_compare,
causing it to stop at the a=1 condition as opposed to check the b < 4
condition?

If so, then this visual explanation helped me understand the point and
why it can't repeat more than once much better than all that text.
Maybe this can be integrated?

This is a narrow edge-case -- it is an awkward one. Recursion does
seem natural here, because we're essentially repeating the call to
_bt_advance_array_keys from inside _bt_advance_array_keys, rather than
leaving it up to the usual external caller to do later on. If you
ripped this code out it would barely be noticeable at all. But it
seems worth it so that we can make a uniform promise to always advance
the array keys to the maximum extent possible, based on what the
caller's tuple tells us about the progress of the scan.

Agreed on the awkwardness and recursion.

Because in one case we follow the convention for scan keys, whereas
so->orderProcs is accessed via an attribute number subscript.

Okay, but how about this in _bt_merge_arrays?

+ Datum *elem = elems_orig + i;

I'm not familiar with the scan key convention, as most other places
use reference+subscripting.

I meant the convention used in code like _bt_check_compare (which is
what we call _bt_checkkeys on HEAD, basically).

Note that the _bt_merge_arrays code that you've highlighted isn't
iterating through so->keyData[] -- it is iterating through the
function caller's elements array, which actually come from
so->arrayKeys[].

It was exactly why I started asking about the use of pointer
additions: _bt_merge_arrays is new code and I can't think of any other
case of this style being used in new code, except maybe when
surrounding code has this style. The reason I first highlighted
_bt_update_keys_with_arraykeys is because it had a clear case of 2
different styles in the same function.

Like every other Postgres contributor, I do my best to follow the
conventions established by existing code. Sometimes that leads to
pretty awkward results, where CamelCase and underscore styles are
closely mixed together, because it works out to be the most consistent
way of doing it overall.

I'm slightly surprised by that, as after this patch I can't find any
code that wasn't touched by this patch in nbtutils that uses + for
pointer offsets.
Either way, that's the style for indexing into a ScanKey, apparently?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v11-0001-nbtree-merge-array-scankeys-s-arrays-more-effici.patch-bapplication/octet-stream; name=v11-0001-nbtree-merge-array-scankeys-s-arrays-more-effici.patch-bDownload

From eef1a00ebc208b8d6c6d02eb37325cd7602fa707 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 16 Jan 2024 18:34:09 +0100
Subject: [PATCH v11] nbtree: merge array scankeys's arrays more efficiently

Instead of n * log(m), we use mergejoin to merge the arrays in
O(n + m) time. We further use exponential search to improve
mergejoin's O(n+m) complexity to O(log(n | m)) in cases where one
array's data range is completely disjunct from the other.

While technically this last case could be further improved to O(1),
it'd be only a marginal speedup when compared to this one's.
---
 src/backend/access/nbtree/nbtutils.c | 207 ++++++++++++++++++++++++++-
 1 file changed, 205 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 78d16d3330..7013bf7315 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,6 +48,9 @@ static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 static int	_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
 							 Datum *elems_orig, int nelems_orig,
 							 Datum *elems_next, int nelems_next);
+static int _bt_merge_arrays_search_next(Datum *elems, int nelems, int start_index,
+										Datum *key, BTSortArrayContext *cxt,
+										int *compare_result);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
 static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
 										   Datum tupdatum, bool tupnull,
@@ -644,8 +647,9 @@ _bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTSortArrayContext cxt;
-	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	Datum	   *merged PG_USED_FOR_ASSERTS_ONLY;
 	int			merged_nelems = 0;
+	int			merged_nelems_check PG_USED_FOR_ASSERTS_ONLY = 0;
 
 	/*
 	 * Incrementally copy the original array into a temp buffer, skipping over
@@ -654,25 +658,224 @@ _bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
 	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
+
+	/*
+	 * When assertions are enabled, use binary searches to create an array of
+	 * matches that we'll use to validate our merge join + exponential search
+	 * algorithm below.
+	 *
+	 * Note that this scratch space is only used in assert-enabled builds; we
+	 * write directly to elems_orig when we don't have assertions enabled,
+	 * saving one palloc/pfree.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+
 	for (int i = 0; i < nelems_orig; i++)
 	{
 		Datum	   *elem = elems_orig + i;
 
 		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
 						_bt_compare_array_elements, &cxt))
-			merged[merged_nelems++] = *elem;
+			merged[merged_nelems_check++] = *elem;
+	}
+#endif
+
+	for (int i = 0, j = 0; i < nelems_orig && j < nelems_next;)
+	{
+		Datum	   *orig = &elems_orig[i];
+		Datum	   *next = &elems_next[j];
+		int			res;
+
+		res = _bt_compare_array_elements(orig, next, &cxt);
+
+		/*
+		 * Ratchet each array forward until we find a match.
+		 */
+		do
+		{
+			if (res < 0)
+			{
+				int prev_i = i;
+				/*
+				 * Find the next element in elems_orig that is >= next,
+				 * storing the compare result in &res
+				 */
+				i = _bt_merge_arrays_search_next(elems_orig, nelems_orig, i,
+												 next, &cxt, &res);
+				Assert(i > prev_i);
+
+				/*
+				 * i is now either out of bounds, or has progressed to the
+				 * first offset that is >= next.
+				 */
+				if (i != nelems_orig)
+				{
+					orig = &elems_orig[i];
+					Assert(_bt_compare_array_elements(orig, next, &cxt) == res);
+					Assert(res >= 0);
+				}
+			}
+			else if (res > 0)
+			{
+				int prev_j = j;
+				/*
+				 * Find the next element in elems_next that is >= next,
+				 * storing the compare result in &res
+				 *
+				 * Note that this does compare(array_elem, key), so the
+				 * compare result in &res must be reversed before use in this
+				 * case.
+				 */
+				j = _bt_merge_arrays_search_next(elems_next, nelems_next, j,
+												 orig, &cxt, &res);
+				res = -res;
+				Assert(j > prev_j);
+
+				/*
+				 * j is now either out of bounds, or has progressed to the
+				 * first offset that is >= orig.
+				 */
+				if (j != nelems_next)
+				{
+					next = &elems_next[j];
+					Assert(_bt_compare_array_elements(orig, next, &cxt) == res);
+					Assert(res <= 0);
+				}
+			}
+		} while (res != 0 && i < nelems_orig && j < nelems_next);
+
+		/*
+		 * We are either at the end of one of the input arrays, or both of
+		 * the current array indexes are equal in this array entry.
+		 */
+		if (res != 0)
+		{
+			Assert(i == nelems_orig || j == nelems_next);
+		}
+		else /* res == 0 */
+		{
+#ifdef USE_ASSERT_CHECKING
+			/*
+			 * Make sure that the current index in elems_orig is not smaller
+			 * than the number of merged elements: if that were the case, we'd
+			 * have double-counted at least one element, which would break
+			 * the assumption we use in non-assertion builds to directly write
+			 * to elems_orig.
+			 */
+			Assert(merged_nelems <= i);
+			Assert(_bt_compare_array_elements(&merged[merged_nelems++], orig, &cxt) == 0);
+#else
+			/* Move the element to the merged section, if needed */
+			if (merged_nelems != i)
+				elems_orig[merged_nelems++] = *orig;
+#endif
+			i++;
+			j++;
+		}
 	}
 
+	Assert(merged_nelems == merged_nelems_check);
+
+#ifdef USE_ASSERT_CHECKING
 	/*
 	 * Overwrite the original array with temp buffer so that we're only left
 	 * with intersecting array elements
 	 */
 	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
 	pfree(merged);
+#endif
 
 	return merged_nelems;
 }
 
+/*
+ * Find the next element in the array that is >= key, using cxt with
+ * _bt_compare_array_elements as sort operator.
+ * 
+ * Will always return a value > start_index.
+ *
+ * Takes O(1) if the next element is a neighbour, up to worst case
+ * O(log(n)) for n remaining entries.
+ *
+ * The function assumes that the input array is sorted.
+ */
+static int
+_bt_merge_arrays_search_next(Datum *elems, int nelems, int start_index,
+							 Datum *key, BTSortArrayContext *cxt,
+							 int *compare_result)
+{
+	int			step = 1;
+	int			min = start_index + 1,
+				max = start_index,
+				compare,
+				max_compare = -1;
+
+	/*
+	 * Exponential search forward to find the first element
+	 * 
+	 * We use exponential search forward to make sure we only need to do one
+	 * compare in many cases.
+	 * 
+	 */
+	for(;;)
+	{
+		max += step;
+
+		if (max >= nelems)
+		{
+			max = nelems;
+			compare = -1;
+			break;
+		}
+
+		compare = _bt_compare_array_elements(&elems[max], key, cxt);
+
+		step = step * 2;
+
+		if (compare < 0)
+			min = max + 1;
+		else
+			break;
+	}
+
+	max_compare = compare;
+
+	/* if we happened to land on an equal tuple we return early */
+	if (compare == 0)
+	{
+		*compare_result = compare;
+		return max;
+	}
+
+	/* Now do a binary search to get to the boundary position */
+	while (max > min)
+	{
+		int mid = min + ((max - min) / 2);
+
+		compare = _bt_compare_array_elements(&elems[mid], key, cxt);
+
+		if (compare < 0)
+		{
+			min = mid + 1;
+		}
+		else if (compare > 0)
+		{
+			max = mid;
+			max_compare = compare;
+		}
+		else if (compare == 0)
+		{
+			max = mid;
+			max_compare = compare;
+			break;
+		}
+	}
+
+	*compare_result = max_compare;
+	return max;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
-- 
2.40.1

v11-0001-nbtree-merge-array-scankeys-s-arrays-more-effici.patch-aapplication/octet-stream; name=v11-0001-nbtree-merge-array-scankeys-s-arrays-more-effici.patch-aDownload

From 35dfa0a8aee457423d702001d2a437b2eaff0146 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 16 Jan 2024 18:34:09 +0100
Subject: [PATCH v11] nbtree: merge array scankeys's arrays more efficiently

Instead of n * log(m), we use mergejoin to merge the arrays in
O(n + m) time. We further use exponential search to improve
mergejoin's O(n+m) complexity to O(log(n | m)) in cases where one
array's data range is completely disjunct from the other.

While technically this last case could be further improved to O(1),
it'd be only a marginal speedup when compared to this one's.
---
 src/backend/access/nbtree/nbtutils.c | 60 +++++++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 78d16d3330..3565e33cd7 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,6 +48,9 @@ static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 static int	_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
 							 Datum *elems_orig, int nelems_orig,
 							 Datum *elems_next, int nelems_next);
+static int _bt_merge_arrays_search_next(Datum *elems, int nelems, int start_index,
+										Datum *key, BTSortArrayContext *cxt,
+										int *compare_result);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
 static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
 										   Datum tupdatum, bool tupnull,
@@ -644,8 +647,9 @@ _bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTSortArrayContext cxt;
-	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	Datum	   *merged PG_USED_FOR_ASSERTS_ONLY;
 	int			merged_nelems = 0;
+	int			merged_nelems_check PG_USED_FOR_ASSERTS_ONLY = 0;
 
 	/*
 	 * Incrementally copy the original array into a temp buffer, skipping over
@@ -654,21 +658,73 @@ _bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
 	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
+
+	/*
+	 * When assertions are enabled, use binary searches to create an array of
+	 * matches that we'll use to validate our merge join + exponential search
+	 * algorithm below.
+	 *
+	 * Note that this scratch space is only used in assert-enabled builds; we
+	 * write directly to elems_orig when we don't have assertions enabled,
+	 * saving one palloc/pfree.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+
 	for (int i = 0; i < nelems_orig; i++)
 	{
 		Datum	   *elem = elems_orig + i;
 
 		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
 						_bt_compare_array_elements, &cxt))
-			merged[merged_nelems++] = *elem;
+			merged[merged_nelems_check++] = *elem;
+	}
+#endif
+
+	for (int i = 0, j = 0; i < nelems_orig && j < nelems_next;)
+	{
+		Datum	   *orig = &elems_orig[i];
+		Datum	   *next = &elems_next[j];
+		int			res;
+
+		res = _bt_compare_array_elements(orig, next, &cxt);
+
+		if (res < 0)
+			i++;
+		else if (res > 0)
+			j++;
+		else /* res == 0 */
+		{
+#ifdef USE_ASSERT_CHECKING
+			/*
+			 * Make sure that the current index in elems_orig is not smaller
+			 * than the number of merged elements: if that were the case, we'd
+			 * have double-counted at least one element, which would break
+			 * the assumption we use in non-assertion builds to directly write
+			 * to elems_orig.
+			 */
+			Assert(merged_nelems <= i);
+			Assert(_bt_compare_array_elements(&merged[merged_nelems++], orig, &cxt) == 0);
+#else
+			/* Move the element to the merged section, if needed */
+			if (merged_nelems != i)
+				elems_orig[merged_nelems++] = *orig;
+#endif
+			i++;
+			j++;
+		}
 	}
 
+	Assert(merged_nelems == merged_nelems_check);
+
+#ifdef USE_ASSERT_CHECKING
 	/*
 	 * Overwrite the original array with temp buffer so that we're only left
 	 * with intersecting array elements
 	 */
 	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
 	pfree(merged);
+#endif
 
 	return merged_nelems;
 }
-- 
2.40.1

#40

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#39)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan <pg@bowt.ie> wrote:
And this is where I disagree with your (percieved) implicit argument
that this should be and always stay this way.

I never said that, and didn't intend to imply it. As I outlined to you
back in November, my general philosophy around APIs (such as the index
AM API) is that ambiguity about the limits and extent of an abstract
interface isn't necessarily a bad thing [1]/messages/by-id/CAH2-WzmWn2_eS_4rWy90DRzC-NW-oponONR6PwMqy+OOuvVyFA@mail.gmail.com. It can actually be a good
thing! Ever hear of Hyrum's Law? Abstractions are very often quite
leaky.

APIs like the index AM API inevitably make trade-offs. Trade-offs are
almost always political, in one way or another. Litigating every
possible question up-front requires knowing ~everything before you
really get started. This mostly ends up being a waste of time, since
many theoretical contentious trade-offs just won't matter one bit in
the long run, for one reason or another (not necessarily because there
is only ever one consumer of an API, for all sorts of reasons).

You don't have to agree with me. That's just what my experience
indicates works best on average, and in the long run. I cannot justify
it any further than that.

By keeping that expectation alive,
this becomes a self-fulfilling prophecy, and I really don't like such
restrictive self-fulfilling prophecies. It's nice that we have index
AM feature flags, but if we only effectively allow one implementation
for this set of flags by ignoring potential other users when changing
behavior, then what is the API good for (apart from abstraction
layering, which is nice)?

I explicitly made it clear that I don't mean that.

I think that may be right, but I could very well have built a
btree-lite that doesn't have the historical baggage of having to deal
with pages from before v12 (version 4) and some improvements that
haven't made it to core yet.

Let me know if you ever do that. Let me know what the problems you
encounter are. I'm quite happy to revise my position on this in light
of new information. I change my mind all the time!

Something like the following, maybe?

"""
Compatibility: The planner will now generate paths with array scan
keys in any column for ordered indexes, rather than on only a prefix
of the index columns. The planner still expects fully ordered data
from those indexes.
Historically, the btree AM couldn't output correctly ordered data for
suffix array scan keys, which was "fixed" by prohibiting the planner
from generating array scan keys without an equality prefix scan key up
to that attribute. In this commit, the issue has been fixed, and the
planner restriction has thus been removed as the only in-core IndexAM
that reports amcanorder now supports array scan keys on any column
regardless of what prefix scan keys it has.
"""

Even if I put something like this in the commit message, I doubt that
Bruce would pick it up in anything like this form (I have my doubts
about it happening no matter what wording is used, actually).

I could include something less verbose, mentioning a theoretical risk
to out-of-core amcanorder routines that coevolved with nbtree,
inherited the same SAOP limitations, and then never got the same set
of fixes.

patch-a is a trivial and clean implementation of mergesort, which
tends to reduce the total number of compare operations if both arrays
are of similar size and value range. It writes data directly back into
the main array on non-assertion builds, and with assertions it reuses
your binary search join on scratch space for algorithm validation.

I'll think about this some more.

patch-b is an improved version of patch-a with reduced time complexity
in some cases: It applies exponential search to reduce work done where
there are large runs of unmatched values in either array, and does
that while keeping O(n+m) worst-case complexity, now getting a
best-case O(log(n)) complexity.

This patch seems massively over-engineered, though. Over 200 new lines
of code, all for a case that is only used when queries are written
with obviously-contradictory quals? It's just not worth it.

The recursive call to _bt_advance_array_keys is needed to deal with
unsatisfied-and-required inequalities that were not detected by an
initial call to _bt_check_compare, following a second retry call to
_bt_check_compare. We know that this recursive call to
_bt_advance_array_keys won't cannot recurse again because we've
already identified which specific inequality scan key it is that's a
still-unsatisfied inequality, following an initial round of array key
advancement.

So, it's a case of this?

scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY
(2 (=current), 3, 4)
tuple: a=2, b=4

I don't understand why your example starts with "scankey: a > 1" and
uses redundant/contradictory scan keys (for both "a" and "b"). For a
forward scan the > scan key won't be seen as required by
_bt_advance_array_keys, which means that it cannot be relevant to the
branch containing the recursive call to _bt_advance_array_keys. (The
later branch that calls _bt_check_compare is another matter, but that
doesn't call _bt_advance_array_keys recursively at all -- that's not
what we're discussing.)

I also don't get why you've added all of this tricky redundancy to the
example you've proposed. That seems to make the example a lot more
complicated, without any apparent advantage. This particular piece of
code has nothing to do with redundant/contradictory scan keys.

We first match the 'a' inequality key, then find an array-key mismatch
breaking out, then move the array keys forward to a=2 (of ANY
(1,2,3)), b=4 (of ANY(2, 3, 4)); and now we have to recheck the later
b < 4 inequality key, as that required inequality key wasn't checked
because the earlier arraykey did not match in _bt_check_compare,
causing it to stop at the a=1 condition as opposed to check the b < 4
condition?

I think that that's pretty close, yes.

Obviously, _bt_check_compare() is going to give up upon finding the
most significant now-unsatisfied scan key (which must be a required
scan key in cases relevant to the code in question) -- at that point
we need to advance the array keys. If we go on to successfully advance
all required arrays to values that all match the corresponding values
from caller's tuple, we might still have to consider inequalities
(that are required in the current scan direction).

It might still turn out that the relevant second call to
_bt_check_compare() (the first one inside _bt_advance_array_keys)
still sets continuescan=false -- despite what we were able to do with
the scan's array keys. At that point, the only possible explanation is
that there is a required inequality that still isn't satisfied. We
should use "beyond_end_advance" advancement to advance the array keys
"incrementally" a second time (in a recursive call "against the same
original tuple").

This can only happen when the value of each of two required scan keys
(some equality scan key plus some later inequality scan key) both
cease to be satisfied at the same time, *within the same tuple*. In
practice this just doesn't happen very often. It could in theory be
avoided altogether (without any behavioral change) by forcing
_bt_check_compare to always assess every scan key, rather than giving
up upon finding any unsatisfied scan key (though that would be very
inefficient).

It'll likely be much easier to see what I mean by considering a real
example. See my test case for this, which I don't currently plan to
commit:

https://github.com/petergeoghegan/postgres/blob/saop-dynamic-skip-v10.0/src/test/regress/sql/dynamic_saop_advancement.sql#L3600

I think that only this test case is the only one that'll actually
break when you just rip out the recursive _bt_advance_array_keys call.
And it still won't give incorrect answers when it breaks -- it just
accesses a single extra leaf page.

As I went into a bit already, upthread, this recursive call business
is a good example of making the implementation more complicated, in
order to preserve interface simplicity and generality. I think that
that's the right trade-off here, despite being kinda awkward.

If so, then this visual explanation helped me understand the point and
why it can't repeat more than once much better than all that text.
Maybe this can be integrated?

An example seems like it'd be helpful as a code comment. Can you come
up with a simplified one?

It was exactly why I started asking about the use of pointer
additions: _bt_merge_arrays is new code and I can't think of any other
case of this style being used in new code, except maybe when
surrounding code has this style. The reason I first highlighted
_bt_update_keys_with_arraykeys is because it had a clear case of 2
different styles in the same function.

Meh.

[1]: /messages/by-id/CAH2-WzmWn2_eS_4rWy90DRzC-NW-oponONR6PwMqy+OOuvVyFA@mail.gmail.com

--
Peter Geoghegan

#41

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#39)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental
on top of your earlier set, and both don't allocate additional memory
in the merge operation in non-assertion builds.

patch-a is a trivial and clean implementation of mergesort, which
tends to reduce the total number of compare operations if both arrays
are of similar size and value range. It writes data directly back into
the main array on non-assertion builds, and with assertions it reuses
your binary search join on scratch space for algorithm validation.

This patch fails some of my tests on non-assert builds only (assert
builds pass all my tests, though). I'm using the first patch on its
own here.

While I tend to be relatively in favor of complicated assertions (I
tend to think the risk of introducing side-effects is worth it), it
looks like you're only performing certain steps in release builds.
This is evident from just looking at the code (there is an #else block
just for the release build in the loop). Note also that
"Assert(_bt_compare_array_elements(&merged[merged_nelems++], orig,
&cxt) == 0)" has side-effects in assert-enabled builds only (it
increments merged_nelems). While it's true that you *also* increment
merged_nelems *outside* of the assertion (or in an #else block used
during non-assert builds), that is conditioned on some other thing (so
it's in no way equivalent to the debug #ifdef USE_ASSERT_CHECKING
case). It's also just really hard to understand what's going on here.

If I was going to do this kind of thing, I'd use two completely
separate loops, that were obviously completely separate (maybe even
two functions). I'd then memcmp() each array at the end.

--
Peter Geoghegan

#42

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Peter Geoghegan (#40)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

I could include something less verbose, mentioning a theoretical risk
to out-of-core amcanorder routines that coevolved with nbtree,
inherited the same SAOP limitations, and then never got the same set
of fixes.

Attached is v11, which now says something like that in the commit
message. Other changes:

* Fixed buggy sorting of arrays using cross-type ORDER procs, by
recognizing that we need to consistently use same-type ORDER procs for
sorting and merging the arrays during array preprocessing.

Obviously, when we sort, we compare array elements to other array
elements (all of the same type). This is true independent of whether
the query itself happens to use a cross type operator/ORDER proc, so
we will need to do two separate ORDER proc lookups in cross-type
scenarios.

* No longer subscript the ORDER proc used for array binary searches
using a scankey subscript. Now there is an additional indirection that
works even in the presence of multiple redundant scan keys that cannot
be detected as such due to a lack of appropriate cross-type support
within an opfamily.

This was subtly buggy before now. Requires a little more coordination
between array preprocessing and standard/primitive index scan
preprocessing, which isn't ideal but seems unavoidable.

* Lots of code polishing, especially within _bt_advance_array_keys().

While _bt_advance_array_keys() still works in pretty much exactly the
same way as it did back in v10, there are now better comments.
Including something about why its recursive call to itself is
guaranteed to use a low, fixed amount of stack space, verified using
an assertion. That addresses a concern held by Matthias.

Outlook
=======

This patch is approaching being committable now. Current plan is to
commit this within the next few weeks.

All that really remains now is to research how we might integrate this
work with the recently added continuescanPrechecked/haveFirstMatch
stuff from Alexander Korotkov, if at all. I've put that off until now
because it isn't particularly fundamental to what I'm doing here, and
seems optional.

I would also like to do more performance validation. Things like the
parallel index scan code could stand to be revisited once again. Plus
I should think about the overhead of array preprocessing when
btrescan() is called many times, from a nested loop join -- I should
have something to say about that concern (raised by Heikki at one
point) before too long.

--
Peter Geoghegan

Attachments:

v11-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v11-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From c26abb0d1baa8554db80a28e7dc93fa55853249d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v11] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The array state machine now advances using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with any out-of-core
amcanorder-based index AMs that coevolved with nbtree.  Such an index AM
could have had similar limitations around SOAP execution, and so could
have come to rely on the planner workarounds removed by this commit.
Although it seems unlikely that such an index AM really exists, it still
warrants a pro forma compatibility item in the release notes.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   50 +-
 src/backend/access/nbtree/nbtree.c         |   80 +-
 src/backend/access/nbtree/nbtsearch.c      |  122 +-
 src/backend/access/nbtree/nbtutils.c       | 2152 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   86 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   15 +
 src/test/regress/expected/btree_index.out  |  479 +++++
 src/test/regress/expected/create_index.out |   31 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/btree_index.sql       |  147 ++
 src/test/regress/sql/create_index.sql      |   10 +-
 12 files changed, 2856 insertions(+), 443 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052..5a96b9125 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,7 +960,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1024,7 +1024,6 @@ typedef struct BTArrayKeyInfo
 {
 	int			scan_key;		/* index of associated key in arrayKeyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1038,13 +1037,15 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	ScanDirection advanceDir;	/* Scan direction when arrays last advanced */
+	bool		needPrimScan;	/* Need primscan to continue in advanceDir? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int		   *orderProcsMap;	/* output scan key's ikey -> ORDER proc map */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1076,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the final tuple from the page.  This must happen
+ * before the first call to _bt_checkkeys.  _bt_checkkeys uses the final tuple
+ * to manage advancement of the scan's array keys more efficiently.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* final tuple (high key for forward scans) */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		finaltupchecked;	/* final tuple checked against current
+									 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1082,6 +1106,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_RDDNARRAY	0x00040000	/* redundant in array preprocessing */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1152,7 +1177,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1245,13 +1270,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_rewind_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup, int tupnatts,
+						  bool continuescanPrechecked, bool haveFirstMatch);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 696d79c08..f28fa227f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -235,7 +235,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		_bt_start_array_keys(scan, dir);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -277,8 +277,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -305,7 +305,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 		_bt_start_array_keys(scan, ForwardScanDirection);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -335,8 +335,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -366,9 +366,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -408,7 +410,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -507,10 +511,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -521,10 +521,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -563,6 +559,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Rewind the scan's array keys, if any */
+			if (so->numArrayKeys)
+				_bt_rewind_array_keys(scan);
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -589,7 +588,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -615,7 +614,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -626,7 +625,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -657,16 +660,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -698,9 +702,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -734,12 +741,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -753,14 +759,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -769,13 +775,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 63ee9ba22..92ca6f028 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,7 +907,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -1527,10 +1527,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
+	BTReadPageState pstate;
+	int			numArrayKeys,
+				itemIndex;
 	int			indnatts;
-	bool		continuescanPrechecked;
+	bool		continuescanPrechecked = false;
 	bool		haveFirstMatch = false;
 
 	/*
@@ -1551,8 +1552,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.finaltupchecked = false;
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	numArrayKeys = so->numArrayKeys;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1599,9 +1605,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * the last item on the page would give a more precise answer.
 	 *
 	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * slowdown of point queries.  Never apply the optimization with a scan
+	 * that uses array keys, either, since that breaks certain assumptions.
+	 * (Our search-type scan keys change whenever _bt_checkkeys advances the
+	 * arrays, invalidating any precheck.  Tracking all that would be tricky.)
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !numArrayKeys && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1619,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Flag variable is set when all scan keys that are required in the
+		 * current scan direction are satisfied by the last item on the page
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, itup, false, indnatts, false, false);
+		continuescanPrechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,8 +1661,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, false, indnatts,
 										 continuescanPrechecked,
 										 haveFirstMatch);
 
@@ -1659,9 +1670,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 * assert-enabled builds we also recheck that the _bt_checkkeys()
 			 * result is the same.
 			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			Assert((!continuescanPrechecked && haveFirstMatch) || numArrayKeys ||
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+												 indnatts, false, false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
@@ -1696,7 +1707,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1724,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			_bt_checkkeys(scan, &pstate, itup, true, truncatt, false, false);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1744,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (numArrayKeys && minoff <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1744,6 +1763,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1754,12 +1774,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1772,9 +1798,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+										 indnatts, continuescanPrechecked,
 										 haveFirstMatch);
 
 			/*
@@ -1782,9 +1807,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 * assert-enabled builds we also recheck that the _bt_checkkeys()
 			 * result is the same.
 			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			Assert((!continuescanPrechecked && !haveFirstMatch) || numArrayKeys ||
+				   passes_quals == _bt_checkkeys(scan, &pstate, itup,
+												 finaltup, indnatts,
+												 false, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1824,7 +1850,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1999,6 +2025,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the right
+		 */
+		if (ScanDirectionIsBackward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys > 0);
+
+			so->currPos.moreRight = true;
+			so->advanceDir = dir;
+			so->needPrimScan = false;
+		}
+
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 	}
@@ -2007,6 +2047,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreRight = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the left
+		 */
+		if (ScanDirectionIsForward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys > 0);
+
+			so->currPos.moreLeft = true;
+			so->advanceDir = dir;
+			so->needPrimScan = false;
+		}
+
 		if (scan->parallel_scan != NULL)
 		{
 			/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2e6fc14d7..201735429 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,23 +33,55 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *sortproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
 
+typedef struct ScanKeyAttr
+{
+	ScanKey		skey;
+	int			ikey;
+} ScanKeyAttr;
+
+static void _bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+								FmgrInfo *orderproc, FmgrInfo **sortprocp);
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
-									  StrategyNumber strat,
+									  Oid elemtype, StrategyNumber strat,
 									  Datum *elems, int nelems);
-static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-									bool reverse,
-									Datum *elems, int nelems);
+static int	_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc,
+									bool reverse, Datum *elems, int nelems);
+static int	_bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+										 IndexTuple tuple, bool readpagetup,
+										 int sktrig);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int sktrig);
+static void _bt_update_keys_with_arraykeys(IndexScanDesc scan);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  int numArrayKeys, bool *continuescan, int *ikey,
+							  bool continuescanPrechecked, bool haveFirstMatch);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -190,21 +222,52 @@ _bt_freestack(BTStack stack)
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
  * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * array elements.
+ *
+ * _bt_preprocess_keys treats each primitive scan as an independent piece of
+ * work.  We perform all preprocessing that must work "across array keys".
+ * This division of labor makes sense once you consider that we're called only
+ * once per btrescan, whereas _bt_preprocess_keys is called once per primitive
+ * index scan.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array key as redundant.  Similarly, we are capable of "merging
+ * together" multiple equality array keys (from two or more input scan keys)
+ * into a single output scan key that contains only the intersecting array
+ * elements.  This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.  It can also allow us to
+ * detect contradictory quals early.
+ *
+ * Note: _bt_start_array_keys actually sets up the cur_elem counters later on,
+ * once the scan direction is known.
  *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
+ *
+ * Note: _bt_preprocess_keys is responsible for creating the so->keyData scan
+ * keys used by _bt_checkkeys.  Index scans that don't use equality array keys
+ * will have _bt_preprocess_keys treat scan->keyData as input and so->keyData
+ * as output.  Scans that use equality array keys have _bt_preprocess_keys
+ * treat so->arrayKeyData (which is our output) as their input, while (as per
+ * usual) outputting so->keyData for _bt_checkkeys.  This function adds an
+ * additional layer of indirection that allows _bt_preprocess_keys to avoid
+ * dealing with SK_SEARCHARRAY directly.  (Actually, _bt_preprocess_keys knows
+ * that it must not eliminate "redundant" scan keys on the basis of what are
+ * actually just the current array elements.)
  */
 void
 _bt_preprocess_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
 	int			numberOfKeys = scan->numberOfKeys;
-	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16	   *indoption = rel->rd_indoption;
 	int			numArrayKeys;
+	int			prevArrayAtt = -1;
+	Oid			prevElemtype = InvalidOid;
 	ScanKey		cur;
 	int			i;
 	MemoryContext oldContext;
@@ -250,18 +313,25 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	oldContext = MemoryContextSwitchTo(so->arrayContext);
 
 	/* Create modifiable copy of scan->keyData in the workspace context */
-	so->arrayKeyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-	memcpy(so->arrayKeyData,
-		   scan->keyData,
-		   scan->numberOfKeys * sizeof(ScanKeyData));
+	so->arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
+	memcpy(so->arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
 
 	/* Allocate space for per-array data in the workspace context */
-	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->advanceDir = NoMovementScanDirection;
+
+	/* Allocate space for ORDER procs that we'll use to advance the arrays */
+	so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+	so->orderProcsMap = (int *) palloc(numberOfKeys * sizeof(int));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
 	for (i = 0; i < numberOfKeys; i++)
 	{
+		FmgrInfo	sortproc;
+		FmgrInfo   *sortprocp = &sortproc;
+		bool		reverse;
+		Oid			elemtype;
 		ArrayType  *arrayval;
 		int16		elmlen;
 		bool		elmbyval;
@@ -273,6 +343,31 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+		reverse = (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0;
+
+		/*
+		 * Determine the nominal datatype of the array elements.  We have to
+		 * support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this is a hack to simplify life for
+		 * ScanKeyInit().
+		 */
+		elemtype = cur->sk_subtype;
+		if (elemtype == InvalidOid)
+			elemtype = rel->rd_opcintype[cur->sk_attno - 1];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way ORDER proc to perform binary
+		 * searches for the next matching array element.  Set that up now.
+		 *
+		 * Array scan keys with cross-type equality operators will require a
+		 * separate same-type ORDER proc for sorting their array.  Otherwise,
+		 * sortproc just points to the same proc used during binary searches.
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+			_bt_setup_array_cmp(scan, cur, elemtype,
+								&so->orderProcs[i], &sortprocp);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -320,7 +415,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTLessStrategyNumber:
 			case BTLessEqualStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTGreaterStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -330,7 +425,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTGreaterEqualStrategyNumber:
 			case BTGreaterStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTLessStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -343,12 +438,67 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/*
 		 * Sort the non-null elements and eliminate any duplicates.  We must
 		 * sort in the same ordering used by the index column, so that the
-		 * successive primitive indexscans produce data in index order.
+		 * arrays can be advanced in lockstep with the scan's progress through
+		 * the index's key space.
 		 */
-		num_elems = _bt_sort_array_elements(scan, cur,
-											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+		Assert(cur->sk_strategy == BTEqualStrategyNumber);
+		num_elems = _bt_sort_array_elements(cur, sortprocp, reverse,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and whole scan keys).
+		 *
+		 * _bt_preprocess_keys is subject to restrictions on eliminating array
+		 * scankeys as redundant: they can't be assumed redundant, since we
+		 * must always keep around a scan key in so->keyData for use with any
+		 * later elements from the same array.  Detecting redundant array
+		 * elements here should more than make up for those restrictions.
+		 *
+		 * We don't support merging arrays (for same-attribute scankeys) when
+		 * the array element types don't match.  This is orthogonal to whether
+		 * or not cross-type operators happen to be in use, so the restriction
+		 * shouldn't come up all that often.  (Note that there are no special
+		 * restrictions on _bt_preprocess_keys's detection of _contradictory_
+		 * array quals, which is generally the case that matters most of all.)
+		 */
+		if (prevArrayAtt == cur->sk_attno && prevElemtype == elemtype)
+		{
+			BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+			Assert(so->arrayKeyData[prev->scan_key].sk_attno == cur->sk_attno);
+			Assert(so->arrayKeyData[prev->scan_key].sk_func.fn_oid ==
+				   cur->sk_func.fn_oid);
+			Assert(so->arrayKeyData[prev->scan_key].sk_collation ==
+				   cur->sk_collation);
+
+			num_elems = _bt_merge_arrays(cur, sortprocp, reverse,
+										 prev->elem_values, prev->num_elems,
+										 elem_values, num_elems);
+
+			pfree(elem_values);
+
+			/*
+			 * If there are no intersecting elements left from merging this
+			 * array into the previous array on the same attribute, the scan
+			 * qual is unsatisfiable
+			 */
+			if (num_elems == 0)
+			{
+				numArrayKeys = -1;
+				break;
+			}
+
+			/*
+			 * Lower the number of elements from the previous array, and mark
+			 * this scan key/array as redundant for every primitive index scan
+			 */
+			prev->num_elems = num_elems;
+			cur->sk_flags |= SK_BT_RDDNARRAY;
+			continue;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
 		 */
@@ -356,6 +506,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		prevArrayAtt = cur->sk_attno;
+		prevElemtype = elemtype;
 	}
 
 	so->numArrayKeys = numArrayKeys;
@@ -363,6 +515,93 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	MemoryContextSwitchTo(oldContext);
 }
 
+/*
+ * _bt_setup_array_cmp() -- Set up array comparison functions
+ *
+ * Sets ORDER proc in caller's orderproc argument, which is used during binary
+ * searches of arrays during the index scan.  Also sets a same-type ORDER proc
+ * in caller's *sortprocp argument.
+ *
+ * Caller should pass an orderproc pointing to space that'll store the ORDER
+ * proc for the scan, and a *sortprocp pointing to its own separate space.
+ *
+ * In the common case where we don't need to deal with cross-type operators,
+ * only one ORDER proc is actually required by caller.  We'll set *sortprocp
+ * to point to the same memory that caller's orderproc continues to point to.
+ * Otherwise, *sortprocp will continue to point to separate memory, which
+ * we'll initialize separately (with an "(elemtype, elemtype)" ORDER proc that
+ * can be used to sort arrays).
+ *
+ * Array preprocessing calls here with all equality strategy scan keys,
+ * including any that don't use an array at all.  See _bt_advance_array_keys
+ * for an explanation of why we need to treat these as degenerate single-value
+ * arrays when the scan advances its array state machine.
+ */
+static void
+_bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+					FmgrInfo *orderproc, FmgrInfo **sortprocp)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	RegProcedure cmp_proc;
+	Oid			opclasstype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
+
+	/*
+	 * Look up the appropriate comparison function in the opfamily.  This must
+	 * use the opclass type as its left hand arg type, and the array element
+	 * as its right hand arg type (since binary searches search for the array
+	 * value that best matches the next on-disk index tuple for the scan).
+	 *
+	 * Note: it's possible that this would fail, if the opfamily lacks the
+	 * required cross-type ORDER proc.  But this is no different to the case
+	 * where _bt_first fails to find an ORDER proc for its insertion scan key.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 opclasstype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, opclasstype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+
+	if (opclasstype == elemtype || !(skey->sk_flags & SK_SEARCHARRAY))
+	{
+		/*
+		 * A second opfamily support proc lookup can be avoided in the common
+		 * case where the ORDER proc used for the scan's binary searches uses
+		 * the opclass/on-disk datatype for both its left and right arguments.
+		 *
+		 * Also avoid a separate lookup whenever scan key lacks an array.
+		 * There is nothing for caller to sort anyway, but be consistent.
+		 */
+		*sortprocp = orderproc;
+		return;
+	}
+
+	/*
+	 * Look up the appropriate same-type comparison function in the opfamily.
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but it seems quite unlikely that an opfamily would omit
+	 * non-cross-type support functions for any datatype that it supports at
+	 * all.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 elemtype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, elemtype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set same-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, *sortprocp, so->arrayContext);
+}
+
 /*
  * _bt_find_extreme_element() -- get least or greatest array element
  *
@@ -371,27 +610,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
  * least element, or BTGreaterStrategyNumber to get the greatest.
  */
 static Datum
-_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
+_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey, Oid elemtype,
 						 StrategyNumber strat,
 						 Datum *elems, int nelems)
 {
 	Relation	rel = scan->indexRelation;
-	Oid			elemtype,
-				cmp_op;
+	Oid			cmp_op;
 	RegProcedure cmp_proc;
 	FmgrInfo	flinfo;
 	Datum		result;
 	int			i;
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
 	/*
 	 * Look up the appropriate comparison operator in the opfamily.
 	 *
@@ -400,6 +629,8 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	 * non-cross-type comparison operators for any datatype that it supports
 	 * at all.
 	 */
+	Assert(skey->sk_strategy != BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
 	cmp_op = get_opfamily_member(rel->rd_opfamily[skey->sk_attno - 1],
 								 elemtype,
 								 elemtype,
@@ -434,50 +665,26 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
  * The array elements are sorted in-place, and the new number of elements
  * after duplicate removal is returned.
  *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * skey identifies the index column whose opfamily determines the comparison
+ * semantics, and sortproc is a corresponding ORDER proc.  If reverse is true,
+ * we sort in descending order.
+ *
+ * Note: sortproc arg must be an ORDER proc suitable for sorting: it must
+ * compare arguments that are both of the same type as the array elements
+ * being sorted (even during scans that perform binary searches against the
+ * arrays using distinct cross-type ORDER procs).
  */
 static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
+_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -488,6 +695,47 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTSortArrayContext cxt;
+	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	int			merged_nelems = 0;
+
+	/*
+	 * Incrementally copy the original array into a temp buffer, skipping over
+	 * any items that are missing from the "next" array
+	 */
+	cxt.sortproc = sortproc;
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0; i < nelems_orig; i++)
+	{
+		Datum	   *elem = elems_orig + i;
+
+		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+						_bt_compare_array_elements, &cxt))
+			merged[merged_nelems++] = *elem;
+	}
+
+	/*
+	 * Overwrite the original array with temp buffer so that we're only left
+	 * with intersecting array elements
+	 */
+	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	pfree(merged);
+
+	return merged_nelems;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -499,7 +747,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->sortproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -507,6 +755,155 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound for backward
+ * scans).  It's safe for searches against required scan key arrays to reuse
+ * earlier search bounds like this because such arrays always advance in
+ * lockstep with the index scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem = 0,
+				mid_elem = -1,
+				high_elem = array->num_elems - 1,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur_elem_start)
+	{
+		if (ScanDirectionIsForward(dir))
+			low_elem = array->cur_elem;
+		else
+			high_elem = array->cur_elem;
+	}
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * It's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the low_elem datum.  Must always set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -532,29 +929,43 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
 
-	so->arraysStarted = true;
+	so->advanceDir = dir;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
+ *
+ * Note: routine only sets so->arrayKeyData[] "input" scankeys to incremented
+ * element values.  It will not set the same values in the scan's search-type
+ * so->keyData[] "output" scan keys.  _bt_update_keys_with_arraykeys needs to
+ * be called to actually change the qual used by _bt_checkkeys to decide which
+ * tuples it should return (new primitive index scans call _bt_preprocess_keys
+ * instead, which builds a whole new set of output keys from scratch).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		found = false;
-	int			i;
+
+	Assert(!so->needPrimScan);
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
@@ -588,85 +999,1043 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+	if (found)
+		return true;
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * Don't allow the entire set of array keys to roll over: restore the
+	 * array keys to the state they were in just before we were called.
+	 *
+	 * This ensures that the array keys only ratchet forward (or backwards in
+	 * the case of backward scans).  Our "so->arrayKeyData[]" scan keys should
+	 * always match the current "so->keyData[]" search-type scan keys (except
+	 * for a brief moment during array key advancement).
 	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
-}
-
-/*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
- *
- * Save the current state of the array keys as the "mark" position.
- */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
-
-	for (i = 0; i < so->numArrayKeys; i++)
+	for (int i = 0; i < so->numArrayKeys; i++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		BTArrayKeyInfo *rollarray = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[rollarray->scan_key];
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
+		if (ScanDirectionIsBackward(dir))
+			rollarray->cur_elem = 0;
+		else
+			rollarray->cur_elem = rollarray->num_elems - 1;
+		skey->sk_argument = rollarray->elem_values[rollarray->cur_elem];
 	}
+
+	return false;
 }
 
 /*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
+ * _bt_rewind_array_keys() -- Handle array keys during btrestrpos
  *
- * Restore the array keys to where they were when the mark was set.
+ * Restore the array keys to the start of the key space for the current scan
+ * direction as of the last time the arrays advanced.
+ *
+ * Once the scan reaches _bt_advance_array_keys, the arrays will advance up to
+ * the key space of the actual tuples from the mark position's leaf page.
  */
 void
-_bt_restore_array_keys(IndexScanDesc scan)
+_bt_rewind_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		changed = false;
-	int			i;
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->numArrayKeys > 0);
+	Assert(so->advanceDir != NoMovementScanDirection);
+
+	for (int i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		int			first_elem_dir;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		if (ScanDirectionIsForward(so->advanceDir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = curArrayKey->num_elems - 1;
+
+		if (curArrayKey->cur_elem != first_elem_dir)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
+			curArrayKey->cur_elem = first_elem_dir;
+			skey->sk_argument = curArrayKey->elem_values[first_elem_dir];
 			changed = true;
 		}
 	}
 
+	if (changed)
+		_bt_update_keys_with_arraykeys(scan);
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
+	 * Invert the scan direction as of the last time the array keys advanced.
 	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * In the common case where the scan direction hasn't changed, this won't
+	 * affect the behavior of the scan at all -- advanceDir will be reset to
+	 * the current scan direction in the next call to _bt_advance_array_keys.
+	 *
+	 * This prevents _bt_steppage from fully trusting currPos.moreRight and
+	 * currPos.moreLeft in cases where _bt_readpage/_bt_checkkeys don't get
+	 * the opportunity to consider advancing the array keys as expected.
 	 */
-	if (changed || !so->arraysStarted)
-	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
+	if (ScanDirectionIsForward(so->advanceDir))
+		so->advanceDir = BackwardScanDirection;
+	else
+		so->advanceDir = ForwardScanDirection;
+
+	so->needPrimScan = false;	/* defensive */
 }
 
+/*
+ * _bt_tuple_before_array_skeys() -- _bt_checkkeys array helper function
+ *
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) must advance the scan's array keys.
+ * Only call here when _bt_check_compare already set continuescan=false.
+ * _bt_checkkeys calls here (in scans with array equality scan keys) to deal
+ * with _bt_check_compare's inability to distinguishing between the < and >
+ * cases (it uses equality operator scan keys, not 3-way ORDER procs).
+ *
+ * We always compare the tuple using the current array keys.  "readpagetup"
+ * indicates if tuple is the scan's current _bt_readpage-wise tuple, rather
+ * than a finaltup precheck/an assertion.  (!readpagetup finaltup precheck
+ * callers won't have actually called _bt_check_compare for finaltup before
+ * calling here, but they only call here to determine if the beginning of
+ * matches for the current set of array keys at least starts somewhere on the
+ * scan's current leaf page.)
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it isn't time to advance the array keys just yet
+ * (during readpagetup calls).  Our readpagetup caller must then suppress its
+ * initial _bt_check_compare call (by setting pstate.continuescan=true once
+ * more), allowing the scan to move on to the next _bt_readpage-wise tuple.
+ * (In the case of !readpagetup finaltup precheck callers, this just indicates
+ * that the start of matches for the current set of required array keys isn't
+ * even on this page.)
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This confirms that readpagetup caller's
+ * _bt_check_compare set pstate.continuescan=false due to locating the true
+ * end of matching tuples for current qual (not some point before the start of
+ * matching tuples), which means it's time to advance any required array keys,
+ * and consider if tuple is a match for the new post-array-advancement qual.
+ * (In the case of !readpagetup finaltup precheck callers, this just indicates
+ * that the start of matches for the current set of required array keys must
+ * be somewhere on the scan's current page.  We don't consider the possible
+ * influence of required-in-opposite-direction-only inequality scan keys on
+ * the initial position that _bt_first would locate for the current qual.  It
+ * is up to _bt_advance_array_keys to deal with that as a special case.)
+ *
+ * As an optimization, readpagetup callers pass a _bt_check_compare-set sktrig
+ * value to indicate which scan key triggered _bt_checkkeys to recheck with us
+ * (!readpagetup callers must always pass sktrig=0).  This allows us to avoid
+ * wastefully checking earlier scan keys that _bt_check_compare already found
+ * to be satisfied by the current qual/set of array keys.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+							 IndexTuple tuple, bool readpagetup, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel);
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+	Assert(sktrig == 0 || readpagetup);
+
+	for (; sktrig < so->numberOfKeys; sktrig++)
+	{
+		ScanKey		cur = so->keyData + sktrig;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
+
+		if (!(((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			  ((cur->sk_flags & SK_BT_REQBKWD) &&
+			   ScanDirectionIsBackward(dir))))
+		{
+			/*
+			 * Not required in current scan direction.
+			 *
+			 * Unlike _bt_check_compare and _bt_advance_array_keys, we never
+			 * deal with non-required keys -- even when they happen to have
+			 * arrays that might need to be advanced.
+			 */
+			continue;
+		}
+
+		/* readpagetup calls require one ORDER proc comparison (at most) */
+		Assert(!readpagetup || cur == so->keyData + sktrig);
+
+		/*
+		 * Inequality strategy scan keys (that are required in current scan
+		 * direction) aren't something that we deal with
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+		{
+			/*
+			 * We must give up right away when this was caller's trigger scan
+			 * key, to avoid confusing our assertions
+			 */
+			if (readpagetup)
+				return false;
+
+			continue;
+		}
+
+		if (cur->sk_attno > ntupatts)
+		{
+			Assert(!readpagetup);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys, forcing another _bt_advance_array_keys call.
+			 *
+			 * You might wonder why we don't treat truncated attributes as
+			 * having values < our equality constraints instead; we're not
+			 * treating the truncated attributes as having -inf values here,
+			 * which is how things are done in _bt_compare.
+			 *
+			 * We're often called during finaltup prechecks, where we help our
+			 * caller to decide whether or not it should terminate the current
+			 * primitive index scan.  Our behavior here implements a policy of
+			 * being slightly optimistic about what will be found on the next
+			 * page when the current primitive scan continues onto that page.
+			 * (This is also closest to what _bt_check_compare does.)
+			 */
+			return false;
+		}
+
+		orderproc = &so->orderProcs[so->orderProcsMap[sktrig]];
+		tupdatum = index_getattr(tuple, cur->sk_attno, itupdesc, &tupnull);
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?
+		 */
+		if (readpagetup || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconclusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or perhaps a call made from an
+		 * assertion.
+		 */
+		Assert(result == 0);
+		Assert(!readpagetup);
+	}
+
+	return false;
+}
+
+/*
+ * _bt_array_keys_remain() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->advanceDir == dir);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  This tells us what to do.  We cannot rely on _bt_first
+	 * always reaching _bt_checkkeys, though.  There are various cases where
+	 * that won't happen.  For example, if the index is completely empty, then
+	 * _bt_first won't get as far as calling _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys...
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * ...though there is one exception: _bt_first's _bt_preprocess_keys
+		 * call can determine that the scan's input scan keys can never be
+		 * satisfied.  That might be true for one set of array keys, but not
+		 * the next set.
+		 *
+		 * Handle this by advancing the array keys incrementally ourselves.
+		 * When this succeeds, start another primitive index scan.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		Assert(!so->needPrimScan);
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+
+		/* Array keys are now exhausted */
+	}
+
+	/*
+	 * Has another primitive index scan been scheduled by _bt_checkkeys?
+	 */
+	if (so->needPrimScan)
+	{
+		/* Yes -- tell caller to call _bt_first once again */
+		so->needPrimScan = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Terminate the top-level scan.
+	 */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Like _bt_check_compare, our return value indicates if tuple satisfied the
+ * qual (specifically our new qual).  There must be a new qual whenever we're
+ * called (unless the top-level scan terminates).  After we return, all later
+ * calls to _bt_check_compare will also use the same new qual (a qual with the
+ * newly advanced array key values that were set here by us).
+ *
+ * We'll also set pstate.continuescan for caller.  When this is set to false,
+ * it usually just ends the ongoing primitive index scan (we'll have scheduled
+ * another one in passing).  But when all required array keys were exhausted,
+ * setting pstate.continuescan=false here ends the top-level index scan (since
+ * no new primitive scan will have been scheduled).  Most calls here will have
+ * us set pstate.continuescan=true, which just indicates that the scan should
+ * proceed onto the next tuple (just like when _bt_check_compare does it).
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys.  Calling here before that
+ * point will prematurely advance the array keys, leading to wrong query
+ * results.
+ *
+ * We're responsible for ensuring that caller's tuple is <= current/newly
+ * advanced required array keys once we return.  We try to find an exact
+ * match, but failing that we'll advance the array keys to whatever set of
+ * array elements comes next in the key space for the current scan direction.
+ * Required array keys "ratchet forwards".  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The invariants are the same for backwards scans, except that the operators
+ * are flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Sometimes that'll
+ * be the sole reason for calling here.  These calls are the only exception to
+ * the general rule about always advancing required array keys (since they're
+ * the only case where we simply don't need to touch any required array, which
+ * must already be satisfied by caller's tuple).  Calls triggered by any scan
+ * key that's required in the current scan direction are strictly guaranteed
+ * to advance the required array keys (or end the top-level scan), though.
+ *
+ * Note that we deal with non-array required equality strategy scan keys as
+ * degenerate single element arrays here.  Obviously, they can never really
+ * advance in the way that real arrays can, but they must still affect how we
+ * advance real array scan keys (exactly like true array equality scan keys).
+ * We have to keep around a 3-way ORDER proc for these (using the "=" operator
+ * won't do), since in general whether the tuple is < or > _any_ unsatisfied
+ * required equality key influences how the scan's real arrays must advance.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple.  This is how we deal with inequalities that are
+ * required in the current scan direction.  They can advance the array keys
+ * here, even though they don't influence the initial positioning strategy
+ * within _bt_first (only inequalities required in the _opposite_ direction to
+ * the scan influence _bt_first in this way).  When sktrig corresponds to a
+ * required _inequality_ scan key that wasn't satisfied by caller's tuple,
+ * we'll still perform array key advancement (just like when sktrig is a
+ * required non-array equality strategy scan key).
+ *
+ * The array keys will always advance to the maximum possible extent that we
+ * can know to be safe based on caller's tuple alone (or else we'll end the
+ * top-level scan) when the call here was triggered by any required scan key
+ * (regardless of whether the scan key uses the equality strategy or not).
+ * Our caller would probably not scan noticeably many extra tuples if we only
+ * provided a weaker version of this guarantee, but we still prefer to be
+ * absolute about it.  This helps make our contract simple but precise.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_exhausted,
+				sktrigrequired = false,
+				beyond_end_advance = false,
+				foundRequiredOppositeDirOnly = false,
+				all_required_or_array_satisfied = true,
+				all_required_satisfied = true;
+
+	/*
+	 * Precondition state machine assertions
+	 */
+	Assert(!so->needPrimScan);
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, false, 0));
+
+	/*
+	 * Iterate through the scan's search-type scankeys (so->keyData[]), and
+	 * set input scan keys (so->arrayKeyData[]) to new array values
+	 */
+	for (ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		FmgrInfo   *orderproc;
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		int			attnum = cur->sk_attno;
+		Datum		tupdatum;
+		bool		requiredSameDir = false,
+					requiredOppositeDirOnly = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		if (cur->sk_flags & SK_SEARCHARRAY &&
+			cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			/* Set up array state */
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+			Assert(skeyarray->sk_attno == attnum);
+		}
+
+		/*
+		 * Optimization: Skip over known-satisfied scan keys
+		 */
+		if (ikey < sktrig)
+			continue;
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			requiredSameDir = true;
+		else if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
+				 ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
+			requiredOppositeDirOnly = true;
+
+		if (ikey == sktrig)
+			sktrigrequired = requiredSameDir;
+
+		/*
+		 * When we come across an inequality scan key that's required in the
+		 * opposite direction only, remember it in a flag variable for later.
+		 * The flag helps with avoiding unnecessary finaltup checks later on.
+		 */
+		if (requiredOppositeDirOnly && sktrigrequired &&
+			all_required_or_array_satisfied)
+		{
+			Assert(cur->sk_strategy != BTEqualStrategyNumber);
+			Assert(all_required_satisfied);
+
+			foundRequiredOppositeDirOnly = true;
+
+			continue;
+		}
+
+		/*
+		 * Other than that, we're not interested in scan keys that aren't
+		 * required in the current scan direction (unless they're non-required
+		 * array equality scan keys, which still need to be advanced by us)
+		 */
+		if (!requiredSameDir && !array)
+			continue;
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now. (We only do comparisons of any required
+		 * non-array equality scan keys that come after the triggering key.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; it knows all the intricacies around
+		 * evaluating inequality strategy scan keys (e.g., row comparisons).
+		 * There is no simple mapping onto the opclass ORDER proc we can use.
+		 * But once we know that we have an unsatisfied inequality, we can
+		 * treat it in the same way as an unsatisfied equality at this point.
+		 *
+		 * The arrays advance correctly in both cases because both involve the
+		 * scan reaching the end of the key space for a higher order array key
+		 * (or some distinct set of higher-order array keys, taken together).
+		 * The only real difference is that in the equality case the end is
+		 * "strictly at the end of an array key", whereas in the inequality
+		 * case it's "within an array key".  Either way we'll increment higher
+		 * order arrays by one increment (the next-highest array might need to
+		 * roll over to the next-next highest array in turn, and so on).
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(requiredSameDir);
+			Assert(all_required_or_array_satisfied && all_required_satisfied);
+			Assert(!arrays_advanced);
+
+			beyond_end_advance = true;
+			all_required_or_array_satisfied = all_required_satisfied = false;
+
+			continue;
+		}
+
+		/*
+		 * Nothing for us to do with a required inequality strategy scan key
+		 * that wasn't the one that _bt_check_compare stopped on
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 *
+		 * See below for a full explanation of "beyond end" advancement.
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				skeyarray->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_required_or_array_satisfied || attnum > ntupatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			/*
+			 * If this is a truncated finaltup high key, we can avoid a
+			 * useless _bt_check_compare recheck later on
+			 */
+			all_required_or_array_satisfied = false;
+
+			/*
+			 * Deliberately don't unset all_required_satisfied, so that when
+			 * we encounter a truncated finaltup high key attribute we'll be
+			 * optimistic about its corresponding required scan key being
+			 * satisfied when we go on to check it against tuples from this
+			 * page's right sibling leaf page.
+			 *
+			 * For example, when a finaltuple "(a, b) = (66, -inf)" advances
+			 * the array keys on "a" from 45 to 66, we'll set "b" to whatever
+			 * the first array element for "b" is.  all_required_satisfied
+			 * won't be unset when we reach "b", so we won't go on to start a
+			 * new primitive index scan once outside the loop.  We'll make the
+			 * optimistic assumption that the current/finaltup page's right
+			 * sibling page leaf page will be found to contain tuples >= our
+			 * new post-finaltup array keys.
+			 *
+			 * There is a chance that we'll find that even the right sibling
+			 * leaf page has a finaltup < our new array keys.  That means that
+			 * our policy incurs a single extra leaf page, that could have
+			 * been avoided by unsetting all_required_satisfied here instead.
+			 * We're optimistic here because being pessimistic loses again and
+			 * again with certain types of queries, whereas being optimistic
+			 * can only lose when we reach a finaltuple that represents the
+			 * boundary between two large, adjoining groups of tuples.
+			 */
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		orderproc = &so->orderProcs[so->orderProcsMap[ikey]];
+		tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		ratchets = (requiredSameDir && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+											  tupdatum, tupnull,
+											  array, cur, &result);
+
+			/*
+			 * Required arrays only ever ratchet forwards (backwards).
+			 *
+			 * This condition makes it safe for binary searches to skip over
+			 * array elements that the scan must already be ahead of by now.
+			 * That is strictly an optimization.  Our assertion verifies that
+			 * the condition holds, which doesn't depend on the optimization.
+			 */
+			Assert(!ratchets ||
+				   ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+					(ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(requiredSameDir);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling for required scan keys that lack
+		 * a real array to advance, nor for redundant scan keys that couldn't
+		 * be eliminated by _bt_preprocess_keys.  It won't matter if some of
+		 * our "true" array scan keys (or even all of them) are non-required.
+		 */
+		if (requiredSameDir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		/*
+		 * Also track whether all relevant attributes from caller's tuple will
+		 * be equal to the scan's array keys once we're done with it
+		 */
+		if (result != 0)
+		{
+			all_required_or_array_satisfied = false;
+			if (requiredSameDir)
+				all_required_satisfied = false;
+		}
+
+		/*
+		 * Optimization: If this call was triggered by a non-required array,
+		 * and we know that tuple won't satisfy the qual, we give up right
+		 * away.  This often avoids advancing the array keys, which avoids
+		 * wasting cycles on updates to unsatisfiable non-required arrays.
+		 */
+		if (!sktrigrequired && !all_required_or_array_satisfied)
+			break;
+
+		/* Advance array keys, even when set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+		}
+	}
+
+	/*
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is the only way that
+	 * the top-level index scan can be terminated here by us.
+	 */
+	arrays_exhausted = false;
+	if (beyond_end_advance)
+	{
+		/* Non-required scan keys never exhaust arrays/end top-level scan */
+		Assert(sktrigrequired && !all_required_satisfied);
+
+		if (!_bt_advance_array_keys_increment(scan, dir))
+			arrays_exhausted = true;
+		else
+			arrays_advanced = true;
+	}
+
+	if (arrays_advanced)
+	{
+		/*
+		 * Finalize advancing the array keys by performing in-place updates to
+		 * the associated array search-type scan keys that _bt_checkkeys uses
+		 */
+		_bt_update_keys_with_arraykeys(scan);
+		so->advanceDir = dir;
+
+		/*
+		 * If any required array keys were advanced, be prepared to recheck
+		 * the final tuple against the new array keys (as an optimization)
+		 */
+		if (sktrigrequired)
+			pstate->finaltupchecked = false;
+	}
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	if (arrays_exhausted)
+	{
+		Assert(sktrigrequired && !all_required_satisfied);
+
+		/*
+		 * End the top-level index scan
+		 */
+		pstate->continuescan = false;	/* Agree with _bt_check_compare */
+		so->needPrimScan = false;	/* All array keys now processed */
+
+		/* This tuple doesn't match any qual */
+		return false;
+	}
+
+	/*
+	 * Does caller's tuple now match the new qual?  Call _bt_check_compare a
+	 * second time to find out (unless it's already clear that it can't).
+	 */
+	if (all_required_or_array_satisfied && arrays_advanced)
+	{
+		int			insktrig = sktrig + 1;
+
+		Assert(all_required_satisfied);
+
+		if (likely(_bt_check_compare(dir, so, tuple, ntupatts, itupdesc,
+									 so->numArrayKeys, &pstate->continuescan,
+									 &insktrig, false, false)))
+			return true;
+
+		/*
+		 * Consider "second pass" handling of required inequalities.
+		 *
+		 * It's possible that our _bt_check_compare call indicated that the
+		 * scan should be terminated due to an unsatisfied inequality that
+		 * wasn't initially recognized as such by us.  Handle this by calling
+		 * ourselves recursively while indicating that the trigger is now the
+		 * inequality that we missed first time around.
+		 *
+		 * We must do this in order to honor our contract with caller.  We
+		 * promise to always advance the array keys to the maximum possible
+		 * extent that we can know to be safe based on caller's tuple alone.
+		 * It probably wouldn't really matter if we just ignored this case
+		 * (the very next tuple could advance the array keys instead), but
+		 * handling this precisely keeps our contract simple and general.
+		 */
+		if (!pstate->continuescan)
+		{
+			ScanKey		inequal PG_USED_FOR_ASSERTS_ONLY = so->keyData + insktrig;
+			bool		satisfied PG_USED_FOR_ASSERTS_ONLY;
+
+			Assert(sktrigrequired);
+
+			/*
+			 * Assert that this scan key is an inequality scan key marked
+			 * required in the current scan direction
+			 */
+			Assert(inequal->sk_strategy != BTEqualStrategyNumber);
+			Assert(((inequal->sk_flags & SK_BT_REQFWD) &&
+					ScanDirectionIsForward(dir)) ||
+				   ((inequal->sk_flags & SK_BT_REQBKWD) &&
+					ScanDirectionIsBackward(dir)));
+
+			/*
+			 * The tuple must use "beyond end" advancement during the
+			 * recursive call, so we cannot possibly end up back here when
+			 * recursing.  We'll consume a small, fixed amount of stack space.
+			 */
+			Assert(!beyond_end_advance);
+
+			satisfied = _bt_advance_array_keys(scan, pstate, tuple, insktrig);
+
+			/* This tuple doesn't satisfy the inequality */
+			Assert(!satisfied);
+			return false;
+		}
+
+		/*
+		 * Some non-required scan key (from new qual) still not satisfied.
+		 *
+		 * All required scan keys are still satisfied, though, so we can trust
+		 * all_required_satisfied below.  We now know for sure that even later
+		 * unsatisfied required inequalities can't have been overlooked.
+		 */
+	}
+
+	/*
+	 * Postcondition state machine assertion (for still-unsatisfied tuples).
+	 *
+	 * Caller's tuple is now < the newly advanced array keys (or > when this
+	 * is a backwards scan) when not all required scan keys from the new qual
+	 * (including any required inequality keys) were found to be satisified.
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, false, 0) ==
+		   !all_required_satisfied);
+
+	/*
+	 * If this call was just to deal with advancing (or considering the need
+	 * to advance) a non-required array scan key, we must stick with the
+	 * current primitive index scan
+	 */
+	if (!sktrigrequired)
+	{
+		Assert(all_required_satisfied && !foundRequiredOppositeDirOnly &&
+			   !arrays_exhausted);
+
+		pstate->continuescan = true;	/* Override _bt_check_compare */
+		so->needPrimScan = false;	/* cannot start new primitive scan */
+
+		/*
+		 * This tuple doesn't satisfy some non-required scan key (typically
+		 * caller's sktrig non-required array scan key, occasionally some
+		 * later non-array scan key that happened to also be unsatisfied)
+		 */
+		return false;
+	}
+
+	/*
+	 * Handle post-array-advance scheduling of new primitive index scans.
+	 *
+	 * By here we have established that the scan's required arrays were
+	 * advanced, but did not become exhausted.
+	 */
+	Assert(arrays_advanced && !arrays_exhausted && sktrigrequired);
+
+	/*
+	 * Handle the case where one or more required scan keys aren't satisfied,
+	 * even though caller's tuple is finaltup -- its leaf page's last tuple.
+	 *
+	 * We shouldn't let our caller continue to the next leaf page unless it's
+	 * already near-certain that it covers key space that's relevant to the
+	 * top-level index scan.  (It's not quite fully certain because we don't
+	 * insist on having an exact match for required truncated attributes.  See
+	 * the comments about truncated finaltup in the loop above for details.)
+	 */
+	if (!all_required_satisfied && tuple == pstate->finaltup)
+	{
+		pstate->continuescan = false;	/* Agree with _bt_check_compare */
+		so->needPrimScan = true;	/* Call _bt_first again */
+
+		/* This tuple (finaltup) doesn't match the qual */
+		return false;
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 * They can signal that we should start a new primitive index scan.
+	 *
+	 * It's possible that the scan is now positioned at the start of
+	 * "matching" tuples (matching according to _bt_tuple_before_array_skeys),
+	 * but is nevertheless still many leaf pages before the page/key space
+	 * that _bt_first is capable of skipping ahead to.  Groveling through all
+	 * of these leaf pages will always give correct answers, but it can be
+	 * very inefficient.  We must avoid scanning extra pages unnecessarily.
+	 *
+	 * Apply a test using finaltup (not caller's tuple) to avoid the problem:
+	 * if even finaltup doesn't satisfy this less significant inequality scan
+	 * key (once we temporarily flip the scan direction), skip by starting a
+	 * new primitive index scan.  When we skip, we know for sure that all of
+	 * the tuples on the current page following caller's tuple are also before
+	 * the _bt_first-wise start of tuples for our new qual.  That suggests
+	 * that there might be many skippable leaf pages beyond the current page.
+	 *
+	 * _bt_tuple_before_array_skeys won't be able to deal with this itself
+	 * later on (it doesn't know how), so we must deal with it now, up front.
+	 */
+	if (foundRequiredOppositeDirOnly && all_required_satisfied &&
+		pstate->finaltup)
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped = -dir;
+		bool		continuescanflip;
+		int			opsktrig;
+		ScanKey		inequal;
+
+		/*
+		 * We're checking finaltup (which is usually not caller's tuple), so
+		 * cannot reuse work from caller's earlier _bt_check_compare call here
+		 */
+		opsktrig = 0;
+		_bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+						  itupdesc, so->numArrayKeys, &continuescanflip,
+						  &opsktrig, false, false);
+
+		/*
+		 * Test "opsktrig > sktrig" to make sure that finaltup contains the
+		 * same prefix of key columns as caller's original tuple (a prefix
+		 * that satisfies required equality scankeys whose ikey is <= sktrig).
+		 *
+		 * Must also avoid mistaking an unsatisfied array scan key that isn't
+		 * required (in either direction) with an unsatisfied inequality scan
+		 * key that is required in the opposite-to-scan direction.
+		 */
+		inequal = so->keyData + opsktrig;
+		if (!continuescanflip && opsktrig > sktrig &&
+			!(inequal->sk_flags & SK_SEARCHARRAY))
+		{
+			/*
+			 * Assert that this scan key is an inequality scan key marked
+			 * required in the opposite-to-scan direction only
+			 */
+			Assert(inequal->sk_strategy != BTEqualStrategyNumber);
+			Assert(((inequal->sk_flags & SK_BT_REQFWD) &&
+					ScanDirectionIsForward(flipped)) ||
+				   ((inequal->sk_flags & SK_BT_REQBKWD) &&
+					ScanDirectionIsBackward(flipped)));
+
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+
+			/*
+			 * We established that caller's tuple doesn't satisfy qual already
+			 * (before we examined finaltup)
+			 */
+			return false;
+		}
+	}
+
+	/*
+	 * Stick with the ongoing primitive index scan for now.
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.  Our caller should locate
+	 * the first tuple >= the array keys before long (or locate the first
+	 * tuple <= the array keys before long).
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will either quickly discover the next tuple
+	 * covered by the current array keys, or quickly discover that it needs
+	 * another primitive index scan (using its finaltup precheck) instead.
+	 * Repeated binary searching (one binary search per array advancement) is
+	 * unlikely to outperform one continuous linear search of the whole page.
+	 */
+	pstate->continuescan = true;	/* Override _bt_check_compare */
+	so->needPrimScan = false;	/* redundant */
+
+	/* This tuple doesn't match the qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -692,7 +2061,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
  * within each attribute may be done as a byproduct of the processing here,
- * but no other code depends on that.
+ * but no other code depends on that.  Note that index scans with array scan
+ * keys depend on state (maintained here by us) that maps each of our input
+ * scan keys to its corresponding output scan key.  This indirection allows
+ * index scans to use an ikey offset-to-output-scankey to look up the cached
+ * ORDER proc for the scankey.
  *
  * The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
  * if they must be satisfied in order to continue the scan forward or backward
@@ -741,6 +2114,18 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here.  They
+ * expect to be able to call _bt_update_keys_with_arraykeys instead.  We need
+ * to be careful about that case when we determine redundancy; equality quals
+ * must not be eliminated as redundant on the basis of array input keys that
+ * might change before another call here can take place.  Note, however, that
+ * the presence of an array scan key doesn't affect how we determine if index
+ * quals are contradictory.  Contradictory qual scans move on to the next
+ * primitive index scan right away, by incrementing the scan's array keys once
+ * control reaches _bt_array_keys_remain.  There won't be a call to
+ * _bt_update_keys_with_arraykeys, so there's nothing for us to break.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -762,8 +2147,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfEqualCols;
 	ScanKey		inkeys;
 	ScanKey		outkeys;
+	int		   *orderProcsMap = NULL;
 	ScanKey		cur;
-	ScanKey		xform[BTMaxStrategyNumber];
+	ScanKeyAttr xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
@@ -780,7 +2166,10 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	 * Read so->arrayKeyData if array keys are present, else scan->keyData
 	 */
 	if (so->arrayKeyData != NULL)
+	{
 		inkeys = so->arrayKeyData;
+		orderProcsMap = so->orderProcsMap;
+	}
 	else
 		inkeys = scan->keyData;
 
@@ -801,6 +2190,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
 			_bt_mark_scankey_required(outkeys);
+		if (orderProcsMap)
+			orderProcsMap[0] = 0;
 		return;
 	}
 
@@ -860,13 +2251,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 * check, and we've rejected any combination of it with a regular
 			 * equality condition; but not with other types of conditions.
 			 */
-			if (xform[BTEqualStrategyNumber - 1])
+			if (xform[BTEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1];
+				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
 
 				for (j = BTMaxStrategyNumber; --j >= 0;)
 				{
-					ScanKey		chk = xform[j];
+					ScanKey		chk = xform[j].skey;
 
 					if (!chk || j == (BTEqualStrategyNumber - 1))
 						continue;
@@ -887,8 +2278,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							so->qual_ok = false;
 							return;
 						}
-						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						else if (!(eq->sk_flags & SK_SEARCHARRAY))
+						{
+							/* else discard the redundant non-equality key */
+							xform[j].skey = NULL;
+							xform[j].ikey = -1;
+						}
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -897,36 +2292,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			}
 
 			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1]
-				&& xform[BTLessEqualStrategyNumber - 1])
+			if (xform[BTLessStrategyNumber - 1].skey &&
+				xform[BTLessEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1];
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1];
+				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
 
 				if (_bt_compare_scankey_args(scan, le, lt, le,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1] = NULL;
+						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTLessStrategyNumber - 1] = NULL;
+						xform[BTLessStrategyNumber - 1].skey = NULL;
 				}
 			}
 
 			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1]
-				&& xform[BTGreaterEqualStrategyNumber - 1])
+			if (xform[BTGreaterStrategyNumber - 1].skey &&
+				xform[BTGreaterEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1];
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1];
+				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
 
 				if (_bt_compare_scankey_args(scan, ge, gt, ge,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1] = NULL;
+						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTGreaterStrategyNumber - 1] = NULL;
+						xform[BTGreaterStrategyNumber - 1].skey = NULL;
 				}
 			}
 
@@ -937,11 +2332,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 */
 			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				if (xform[j])
+				if (xform[j].skey)
 				{
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-					memcpy(outkey, xform[j], sizeof(ScanKeyData));
+					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+					if (orderProcsMap)
+						orderProcsMap[new_numberOfKeys - 1] = xform[j].ikey;
 					if (priorNumberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
 				}
@@ -967,6 +2364,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
+			if (orderProcsMap)
+				orderProcsMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 
@@ -978,47 +2377,214 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
-		/* have we seen one of these before? */
-		if (xform[j] == NULL)
+		/*
+		 * Is this an array scan key that _bt_preprocess_array_keys merged
+		 * with some earlier array key during its initial preprocessing pass?
+		 */
+		if (cur->sk_flags & SK_BT_RDDNARRAY)
 		{
-			/* nope, so remember this scankey */
-			xform[j] = cur;
+			/*
+			 * key is redundant for this primitive index scan (and will be
+			 * redundant during all subsequent primitive index scans)
+			 */
+			Assert(j == (BTEqualStrategyNumber - 1));
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(xform[j].skey->sk_attno == cur->sk_attno);
+			continue;
+		}
+
+		/*
+		 * have we seen a scan key for this same attribute and using this same
+		 * operator strategy before now?
+		 */
+		if (xform[j].skey == NULL)
+		{
+			/* nope, so this scan key wins by default (at least for now) */
+			xform[j].skey = cur;
+			xform[j].ikey = i;
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
+			ScanKey		outkey;
+
+			/* yup, keep only the more restrictive key if possible */
+			if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
 										 &test_result))
 			{
 				if (test_result)
-					xform[j] = cur;
-				else if (j == (BTEqualStrategyNumber - 1))
 				{
-					/* key == a && key == b, but a != b */
-					so->qual_ok = false;
-					return;
-				}
-				/* else old key is more restrictive, keep it */
-			}
-			else
-			{
-				/*
-				 * We can't determine which key is more restrictive.  Keep the
-				 * previous one in xform[j] and push this one directly to the
-				 * output array.
-				 */
-				ScanKey		outkey = &outkeys[new_numberOfKeys++];
+					/* Redundant scan keys */
+					if (j == (BTEqualStrategyNumber - 1) &&
+						(xform[j].skey->sk_flags & SK_SEARCHARRAY))
+					{
+						/*
+						 * Equality strategy array scan keys can never be
+						 * truly redundant (unless marked SK_BT_RDDNARRAY). We
+						 * cannot eliminate our previous best scan key, since
+						 * _bt_update_keys_with_arraykeys might be broken by
+						 * that later on.
+						 *
+						 * Fall through to "keep both" path usually used when
+						 * we cannot prove which key is more restrictive
+						 * either way.
+						 */
+					}
+					else
+					{
+						/*
+						 * Replace previous best scan key with new best scan
+						 * key (this scan key, cur)
+						 */
+						Assert((xform[j].skey->sk_flags & SK_SEARCHARRAY) == 0 ||
+							   xform[j].skey->sk_strategy != BTEqualStrategyNumber);
 
-				memcpy(outkey, cur, sizeof(ScanKeyData));
-				if (numberOfEqualCols == attno - 1)
-					_bt_mark_scankey_required(outkey);
+						xform[j].skey = cur;
+						xform[j].ikey = i;
+						continue;
+					}
+				}
+				else
+				{
+					if (j == (BTEqualStrategyNumber - 1))
+					{
+						/* key == a && key == b, but a != b */
+						so->qual_ok = false;
+						return;
+					}
+					else
+					{
+						/*
+						 * Do nothing with cur -- xform[j] is more
+						 * restrictive, and so will usually be chosen for this
+						 * attribute when we're done with its scan keys.
+						 */
+						continue;
+					}
+				}
 			}
+
+			/*
+			 * Keep both.
+			 *
+			 * We can't determine which key is more restrictive (or we can't
+			 * eliminate an array scan key).  Replace it in xform[j], and push
+			 * the cur one directly to the output array, too.
+			 */
+			outkey = &outkeys[new_numberOfKeys++];
+
+			memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+			if (orderProcsMap)
+				orderProcsMap[new_numberOfKeys - 1] = xform[j].ikey;
+			if (numberOfEqualCols == attno - 1)
+				_bt_mark_scankey_required(outkey);
+			xform[j].skey = cur;
+			xform[j].ikey = i;
 		}
 	}
 
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+/*
+ *	_bt_update_keys_with_arraykeys() -- Finalize advancing array keys
+ *
+ * Transfers newly advanced array keys that were set in "so->arrayKeyData[]"
+ * over to corresponding "so->keyData[]" scan keys.  Reuses most of the work
+ * that took place within _bt_preprocess_keys, only changing the array keys.
+ *
+ * It's safe to call here while holding a buffer lock, which isn't something
+ * that _bt_preprocess_keys can guarantee.
+ */
+static void
+_bt_update_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	Assert(so->qual_ok);
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		Assert((cur->sk_flags & SK_BT_RDDNARRAY) == 0);
+
+		/* Just update equality array scan keys */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Update the scan key's argument */
+		Assert(cur->sk_attno == skeyarray->sk_attno);
+		cur->sk_argument = skeyarray->sk_argument;
+	}
+
+	Assert(arrayidx == so->numArrayKeys);
+}
+
+/*
+ * Verify that the scan's "so->arrayKeyData[]" scan keys are in agreement with
+ * the current "so->keyData[]" search-type scan keys.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			last_proc_map = -1,
+				last_sk_attno = 0,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/*
+		 * Verify that so->orderProcsMap[] mappings are in order for
+		 * SK_SEARCHARRAY equality strategy scan keys
+		 */
+		if (last_proc_map >= so->orderProcsMap[ikey])
+			return false;
+		last_proc_map = so->orderProcsMap[ikey];
+
+		/* Verify so->arrayKeyData[] input key has expected sk_argument */
+		if (skeyarray->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+
+		/* Verify so->arrayKeyData[] input key agrees with output key */
+		if (cur->sk_attno != skeyarray->sk_attno)
+			return false;
+		if (cur->sk_argument != skeyarray->sk_argument)
+			return false;
+		if (last_sk_attno > cur->sk_attno)
+			return false;
+		last_sk_attno = cur->sk_attno;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1352,60 +2918,209 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers call with a high key tuple last in the hopes of having
+ * us set pstate.continuescan to false, and avoiding an unnecessary visit to
+ * the page to the right.  Pass finaltup=true for these high key calls.
+ * Backwards scan callers shouldn't do this, but should still let us know
+ * which tuple is last by passing finaltup=true for the final non-pivot tuple
+ * (the non-pivot tuple at page offset number one).
+ *
+ * Callers with equality strategy array scan keys must set up page state that
+ * helps us know when to start or stop primitive index scans on their behalf.
+ * The finaltup tuple should be stashed in pstate.finaltup, so we don't have
+ * to wait until the finaltup call to be able to see what's up with the page.
+ *
+ * Advances the scan's array keys in passing when required.  Note that we rely
+ * on _bt_readpage calling here in page offset number order (for the current
+ * scan direction).  Any other order confuses array advancement.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
+ * finaltup: Is tuple the final one we'll be called with for this page?
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
+ * continuescanPrechecked: indicates that continuescan flag is known to
  * 						   be true for the last item on the page
  * haveFirstMatch: indicates that we already have at least one match
  * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup, int tupnatts,
 			  bool continuescanPrechecked, bool haveFirstMatch)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			numArrayKeys = so->numArrayKeys;
+	ScanDirection dir = pstate->dir;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!numArrayKeys || so->advanceDir == dir);
+	Assert(!so->needPrimScan);
 
+	res = _bt_check_compare(dir, so, tuple, tupnatts, tupdesc,
+							numArrayKeys, &pstate->continuescan, &ikey,
+							continuescanPrechecked, haveFirstMatch);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality strategy array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * pstate.continuescan=false.
+	 */
+	if (!numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.  This likely means that the tuple is just
+	 * past the end of matches for the current array keys (if the current set
+	 * of array keys is the final set, the top-level scan will terminate).
+	 *
+	 * It's also possible that the scan is still _before_ the _start_ of
+	 * tuples matching the current set of array keys.  Check for that first.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, dir, tuple, true, ikey))
+	{
+		/*
+		 * Current tuple is < the current array scan keys/equality constraints
+		 * (or > in the backward scan case).  Don't need to advance the array
+		 * keys.  Must decide whether to start a new primitive scan instead.
+		 *
+		 * If this tuple isn't the finaltup for the page, then recheck the
+		 * finaltup stashed in pstate as an optimization.  That allows us to
+		 * quit scanning this page early when it's clearly hopeless (we don't
+		 * need to wait for the finaltup call to give up on a primitive scan).
+		 */
+		if (finaltup || (!pstate->finaltupchecked && pstate->finaltup &&
+						 _bt_tuple_before_array_skeys(scan, dir,
+													  pstate->finaltup,
+													  false, 0)))
+		{
+			/*
+			 * Give up on the ongoing primitive index scan.
+			 *
+			 * Even the final tuple (the high key for forward scans, or the
+			 * tuple from page offset number 1 for backward scans) is before
+			 * the current array keys.  That strongly suggests that continuing
+			 * this primitive scan would be less efficient than starting anew.
+			 *
+			 * See also: _bt_advance_array_keys's handling of the case where
+			 * finaltup itself advances the array keys to non-matching values.
+			 */
+			pstate->continuescan = false;
+
+			/*
+			 * Set up a new primitive index scan that will reposition the
+			 * top-level scan to the first leaf page whose key space is
+			 * covered by our array keys.  The top-level scan will "skip" a
+			 * part of the index that can only contain non-matching tuples.
+			 *
+			 * Note: the next primitive index scan is guaranteed to land on
+			 * some later leaf page (ideally it won't be this page's sibling).
+			 * It follows that the top-level scan can never access the same
+			 * leaf page more than once (unless the scan changes direction or
+			 * btrestrpos is called).  btcostestimate relies on this.
+			 */
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/*
+			 * Stick with the ongoing primitive index scan, for now (override
+			 * _bt_check_compare's suggestion that we end the scan).
+			 *
+			 * Note: we will end up here again and again given a group of
+			 * tuples > the previous array keys and < the now-current keys
+			 * (though only after an initial finaltup precheck determined that
+			 * this page definitely covers key space from both array keysets).
+			 * In effect, we perform a linear search of the page's remaining
+			 * unscanned tuples every time the arrays advance past the key
+			 * space of the scan's then-current tuple.
+			 */
+			pstate->continuescan = true;
+
+			/*
+			 * Our finaltup precheck determined that it is >= the current keys
+			 * (though the _current_ tuple is still < the current array keys).
+			 *
+			 * Remember that fact in pstate now.  This avoids wasting cycles
+			 * on repeating the same precheck step (checking the same finaltup
+			 * against the same array keys) during later calls here for later
+			 * tuples from this same leaf page.
+			 */
+			pstate->finaltupchecked = true;
+		}
+
+		/* This indextuple doesn't match the qual */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scan).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (unless the required array keys become exhausted instead, or
+	 * unless the ikey trigger corresponds to a non-required array scan key).
+	 *
+	 * Note: we might even advance the required arrays when all existing keys
+	 * are already equal to the values from the tuple at this point.  See the
+	 * comments above _bt_advance_array_keys about required-inequality-driven
+	 * array advancement.
+	 *
+	 * Note: we _won't_ advance any required arrays when the ikey/trigger scan
+	 * key corresponds to a non-required array found to be unsatisfied by the
+	 * current keys.  (We might not even "advance" the non-required array.)
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, ikey);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (which has more high
+ * level context than us) to override that initial determination when it makes
+ * more sense to advance the array keys and continue with further tuples from
+ * the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  int numArrayKeys, bool *continuescan, int *ikey,
+				  bool continuescanPrechecked, bool haveFirstMatch)
+{
 	*continuescan = true;		/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (; *ikey < so->numberOfKeys; (*ikey)++)
 	{
+		ScanKey		key = so->keyData + *ikey;
 		Datum		datum;
 		bool		isNull;
 		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1423,7 +3138,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
+		if ((requiredSameDir || (requiredOppositeDirOnly && haveFirstMatch)) &&
 			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
 			continue;
 
@@ -1435,7 +3150,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1525,12 +3239,29 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key checking function.  When the key is required for
+		 * opposite-direction scans it must be an inequality satisfied by
+		 * _bt_first(), barring NULLs, which we just checked a moment ago.
+		 *
+		 * (Also can't apply this optimization with scans that use arrays,
+		 * since _bt_advance_array_keys() sometimes allows the scan to see a
+		 * few tuples from before the would-be _bt_first() starting position
+		 * for the scan's just-advanced array keys.)
+		 *
+		 * Even required equality quals (that can't use this optimization due
+		 * to being required in both scan directions) rely on the assumption
+		 * that _bt_first() will always use the quals for initial positioning
+		 * purposes.  We stop the scan as soon as any required equality qual
+		 * fails, so it had better only happen at the end of equal tuples in
+		 * the current scan direction (never at the start of equal tuples).
+		 * See comments in _bt_first().
+		 *
+		 * (The required equality quals issue also has specific implications
+		 * for scans that use arrays.  They sometimes perform a linear search
+		 * of remaining unscanned tuples, forcing the primitive index scan to
+		 * continue until it locates tuples >= the scan's new array keys.)
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
+		if (!(requiredOppositeDirOnly && haveFirstMatch) || numArrayKeys)
 		{
 			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
 									 datum, key->sk_argument);
@@ -1548,15 +3279,25 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * Tuple fails this qual.  If it's a required qual for the current
 			 * scan direction, then we can conclude no further tuples will
 			 * pass, either.
-			 *
-			 * Note: because we stop the scan as soon as any required equality
-			 * qual fails, it is critical that equality quals be used for the
-			 * initial positioning in _bt_first() when they are available. See
-			 * comments in _bt_first().
 			 */
 			if (requiredSameDir)
 				*continuescan = false;
 
+			/*
+			 * Always set continuescan=false for equality-type array keys that
+			 * don't pass -- even for an array scan key not marked required.
+			 *
+			 * A non-required scan key (array or otherwise) can never actually
+			 * terminate the scan.  It's just convenient for callers to treat
+			 * continuescan=false as a signal that it might be time to advance
+			 * the array keys, independent of whether they're required or not.
+			 * (Even setting continuescan=false with a required scan key won't
+			 * usually end a scan that uses arrays.)
+			 */
+			if (numArrayKeys && (key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+				*continuescan = false;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1575,7 +3316,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1604,7 +3345,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbd..772c294f5 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +862,20 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			/*
+			 * We might need to omit ScalarArrayOpExpr clauses when index AM
+			 * lacks native support
+			 */
+			if (!index->amsearcharray && IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
+				if (skip_nonnative_saop)
 				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
+					/* Ignore because not supported by index */
+					*skip_nonnative_saop = true;
+					continue;
 				}
+				/* Caller had better intend this only for bitmap scan */
+				Assert(scantype == ST_BITMAPSCAN);
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d..47de61da1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6557,8 +6557,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6573,7 +6571,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6603,19 +6601,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6653,27 +6640,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6683,11 +6674,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6703,10 +6692,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6719,7 +6706,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6797,7 +6784,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6812,17 +6798,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6862,14 +6843,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				double		alength = estimate_array_length(root, other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6929,13 +6905,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6945,6 +6914,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6952,9 +6963,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +6982,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5cf9363ac..7b3aac78e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4063,6 +4063,21 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 8311a03c3..d159091ab 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -434,3 +434,482 @@ ALTER INDEX btree_part_idx ALTER COLUMN id SET (n_distinct=100);
 ERROR:  ALTER action ALTER COLUMN ... SET cannot be performed on relation "btree_part_idx"
 DETAIL:  This operation is not supported for partitioned indexes.
 DROP TABLE btree_part;
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.
+create temp table backup_wrong_tbl (district int4, warehouse int4, orderid int4, orderline int4);
+create index backup_wrong_idx on backup_wrong_tbl (district, warehouse, orderid, orderline);
+insert into backup_wrong_tbl
+select district, warehouse, orderid, orderline
+from
+  generate_series(1, 3) district,
+  generate_series(1, 2) warehouse,
+  generate_series(1, 51) orderid,
+  generate_series(1, 10) orderline;
+begin;
+declare back_up_terminate_toplevel_wrong cursor for
+select * from backup_wrong_tbl
+where district in (1, 3) and warehouse in (1,2)
+and orderid in (48, 50)
+order by district, warehouse, orderid, orderline;
+fetch forward 60 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      48 |         1
+        1 |         1 |      48 |         2
+        1 |         1 |      48 |         3
+        1 |         1 |      48 |         4
+        1 |         1 |      48 |         5
+        1 |         1 |      48 |         6
+        1 |         1 |      48 |         7
+        1 |         1 |      48 |         8
+        1 |         1 |      48 |         9
+        1 |         1 |      48 |        10
+        1 |         1 |      50 |         1
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         6
+        3 |         1 |      48 |         7
+        3 |         1 |      48 |         8
+        3 |         1 |      48 |         9
+        3 |         1 |      48 |        10
+        3 |         1 |      50 |         1
+        3 |         1 |      50 |         2
+        3 |         1 |      50 |         3
+        3 |         1 |      50 |         4
+        3 |         1 |      50 |         5
+        3 |         1 |      50 |         6
+        3 |         1 |      50 |         7
+        3 |         1 |      50 |         8
+        3 |         1 |      50 |         9
+        3 |         1 |      50 |        10
+(60 rows)
+
+fetch backward 29 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      50 |         9
+        3 |         1 |      50 |         8
+        3 |         1 |      50 |         7
+        3 |         1 |      50 |         6
+        3 |         1 |      50 |         5
+        3 |         1 |      50 |         4
+        3 |         1 |      50 |         3
+        3 |         1 |      50 |         2
+        3 |         1 |      50 |         1
+        3 |         1 |      48 |        10
+        3 |         1 |      48 |         9
+        3 |         1 |      48 |         8
+        3 |         1 |      48 |         7
+        3 |         1 |      48 |         6
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+(29 rows)
+
+fetch forward 12 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+(12 rows)
+
+fetch backward 30 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+(30 rows)
+
+fetch forward  31 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+(31 rows)
+
+fetch backward 32 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         2
+(32 rows)
+
+fetch forward  33 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+(33 rows)
+
+fetch backward 34 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         1
+        1 |         2 |      50 |        10
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         1
+        1 |         2 |      48 |        10
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         1
+        1 |         1 |      50 |        10
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         1
+(34 rows)
+
+fetch forward  35 from back_up_terminate_toplevel_wrong;
+ district | warehouse | orderid | orderline 
+----------+-----------+---------+-----------
+        1 |         1 |      50 |         2
+        1 |         1 |      50 |         3
+        1 |         1 |      50 |         4
+        1 |         1 |      50 |         5
+        1 |         1 |      50 |         6
+        1 |         1 |      50 |         7
+        1 |         1 |      50 |         8
+        1 |         1 |      50 |         9
+        1 |         1 |      50 |        10
+        1 |         2 |      48 |         1
+        1 |         2 |      48 |         2
+        1 |         2 |      48 |         3
+        1 |         2 |      48 |         4
+        1 |         2 |      48 |         5
+        1 |         2 |      48 |         6
+        1 |         2 |      48 |         7
+        1 |         2 |      48 |         8
+        1 |         2 |      48 |         9
+        1 |         2 |      48 |        10
+        1 |         2 |      50 |         1
+        1 |         2 |      50 |         2
+        1 |         2 |      50 |         3
+        1 |         2 |      50 |         4
+        1 |         2 |      50 |         5
+        1 |         2 |      50 |         6
+        1 |         2 |      50 |         7
+        1 |         2 |      50 |         8
+        1 |         2 |      50 |         9
+        1 |         2 |      50 |        10
+        3 |         1 |      48 |         1
+        3 |         1 |      48 |         2
+        3 |         1 |      48 |         3
+        3 |         1 |      48 |         4
+        3 |         1 |      48 |         5
+        3 |         1 |      48 |         6
+(35 rows)
+
+commit;
+create temp table outer_table                  (a int, b int);
+create temp table restore_buggy_primscan_table (x int, y int);
+create index buggy_idx on restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+insert into outer_table                  select 1, b_vals from generate_series(1006, 1580) b_vals;
+insert into restore_buggy_primscan_table select 1, x_vals from generate_series(1006, 1580) x_vals;
+insert into outer_table                  select 1, 1370 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1371 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1380 from generate_series(1, 9) j;
+vacuum analyze outer_table;
+vacuum analyze restore_buggy_primscan_table;
+select count(*), o.a, o.b
+  from
+    outer_table o
+  inner join
+    restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x = 1 and
+  bug.y = any(array[(select array_agg(i) from generate_series(1370, 1390) i where i % 10 = 0)])
+group by o.a, o.b;
+ count | a |  b   
+-------+---+------
+    10 | 1 | 1370
+    10 | 1 | 1380
+     1 | 1 | 1390
+(3 rows)
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys().  This is handled like the case where the scan
+-- direction changes "within" a page, relying on code from _bt_readnextpage().
+create temp table outer_tab(
+  a int,
+  b int
+);
+create index outer_tab_idx on outer_tab(a, b) with (deduplicate_items = off);
+create temp table primscanmarkcov_table(
+  a int,
+  b int
+);
+create index interesting_coverage_idx on primscanmarkcov_table(a, b) with (deduplicate_items = off);
+insert into outer_tab             select 1, i from generate_series(1530, 1780) i;
+insert into primscanmarkcov_table select 1, i from generate_series(1530, 1780) i;
+insert into outer_tab             select 1, 1550 from generate_series(1, 200) i;
+insert into primscanmarkcov_table select 1, 1551 from generate_series(1, 200) i;
+vacuum analyze outer_tab;
+vacuum analyze primscanmarkcov_table ;
+with range_ints as ( select i from generate_series(1530, 1780) i)
+select
+  count(*), buggy.a, buggy.b from
+outer_tab o
+  inner join
+primscanmarkcov_table buggy
+  on o.a = buggy.a and o.b = buggy.b
+where
+  o.a = 1     and     o.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])  and
+  buggy.a = 1 and buggy.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])
+group by buggy.a, buggy.b
+order by buggy.a, buggy.b;
+ count | a |  b   
+-------+---+------
+   201 | 1 | 1550
+     1 | 1 | 1600
+     1 | 1 | 1650
+     1 | 1 | 1700
+     1 | 1 | 1750
+(5 rows)
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys() for backwards scans.  More or less comparable to
+-- the last test.
+create temp table backwards_prim_outer_table             (a int, b int);
+create temp table backwards_restore_buggy_primscan_table (x int, y int);
+create index backward_prim_buggy_idx  on backwards_restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+create index backwards_prim_drive_idx on backwards_prim_outer_table             (a, b) with (deduplicate_items=off);
+insert into backwards_prim_outer_table                  select 0, 1360;
+insert into backwards_prim_outer_table                  select 1, b_vals from generate_series(1012, 1406) b_vals where b_vals % 10 = 0;
+insert into backwards_prim_outer_table                  select 1, 1370;
+vacuum analyze backwards_prim_outer_table; -- Be tidy
+-- Fill up "backwards_prim_drive_idx" index with 396 items, just about fitting
+-- onto its only page, which is a root leaf page:
+insert into backwards_restore_buggy_primscan_table select 0, 1360;
+insert into backwards_restore_buggy_primscan_table select 1, x_vals from generate_series(1012, 1406) x_vals;
+vacuum analyze backwards_restore_buggy_primscan_table; -- Be tidy
+-- Now cause two page splits, leaving 4 leaf pages in total:
+insert into backwards_restore_buggy_primscan_table select 1, 1370 from generate_series(1,250) i;
+-- Now "buggy" index looks like this:
+--
+-- ┌───┬───────┬───────┬────────┬────────┬────────────┬───────┬───────┬───────────────────┬─────────┬───────────┬──────────────────┐
+-- │ i │ blkno │ flags │ nhtids │ nhblks │ ndeadhblks │ nlive │ ndead │ nhtidschecksimple │ avgsize │ freespace │     highkey      │
+-- ├───┼───────┼───────┼────────┼────────┼────────────┼───────┼───────┼───────────────────┼─────────┼───────────┼──────────────────┤
+-- │ 1 │     1 │     1 │    203 │      1 │          0 │   204 │     0 │                 0 │      16 │     4,068 │ (x, y)=(1, 1214) │
+-- │ 2 │     4 │     1 │    156 │      2 │          0 │   157 │     0 │                 0 │      16 │     5,008 │ (x, y)=(1, 1370) │
+-- │ 3 │     5 │     1 │    251 │      2 │          0 │   252 │     0 │                 0 │      16 │     3,108 │ (x, y)=(1, 1371) │
+-- │ 4 │     2 │     1 │     36 │      1 │          0 │    36 │     0 │                 0 │      16 │     7,428 │ ∅                │
+-- └───┴───────┴───────┴────────┴────────┴────────────┴───────┴───────┴───────────────────┴─────────┴───────────┴──────────────────┘
+select count(*), o.a, o.b
+  from
+    backwards_prim_outer_table o
+  inner join
+    backwards_restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x in (0, 1) and
+  bug.y = any(array[(select array_agg(i) from generate_series(1360, 1370) i where i % 10 = 0)])
+group by o.a, o.b
+order by o.a desc, o.b desc;
+ count | a |  b   
+-------+---+------
+   502 | 1 | 1370
+     1 | 1 | 1360
+     1 | 0 | 1360
+(3 rows)
+
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 79fa117cb..f5865494c 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1951,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 0c2cba892..73aee1a50 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8832,10 +8832,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index ef8435423..330edbb1d 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -267,3 +267,150 @@ CREATE TABLE btree_part (id int4) PARTITION BY RANGE (id);
 CREATE INDEX btree_part_idx ON btree_part(id);
 ALTER INDEX btree_part_idx ALTER COLUMN id SET (n_distinct=100);
 DROP TABLE btree_part;
+
+-- Add tests to give coverage of various subtle issues.
+--
+-- XXX This may not be suitable for commit, due to taking up too many cycles.
+--
+-- Here we don't remember the scan's array keys before processing a page, only
+-- after processing a page (which is implicit, it's just the scan's current
+-- keys).  So when we move the scan backwards we think that the top-level scan
+-- should terminate, when in reality it should jump backwards to the leaf page
+-- that we last visited.
+create temp table backup_wrong_tbl (district int4, warehouse int4, orderid int4, orderline int4);
+create index backup_wrong_idx on backup_wrong_tbl (district, warehouse, orderid, orderline);
+insert into backup_wrong_tbl
+select district, warehouse, orderid, orderline
+from
+  generate_series(1, 3) district,
+  generate_series(1, 2) warehouse,
+  generate_series(1, 51) orderid,
+  generate_series(1, 10) orderline;
+
+begin;
+declare back_up_terminate_toplevel_wrong cursor for
+select * from backup_wrong_tbl
+where district in (1, 3) and warehouse in (1,2)
+and orderid in (48, 50)
+order by district, warehouse, orderid, orderline;
+
+fetch forward 60 from back_up_terminate_toplevel_wrong;
+fetch backward 29 from back_up_terminate_toplevel_wrong;
+fetch forward 12 from back_up_terminate_toplevel_wrong;
+fetch backward 30 from back_up_terminate_toplevel_wrong;
+fetch forward  31 from back_up_terminate_toplevel_wrong;
+fetch backward 32 from back_up_terminate_toplevel_wrong;
+fetch forward  33 from back_up_terminate_toplevel_wrong;
+fetch backward 34 from back_up_terminate_toplevel_wrong;
+fetch forward  35 from back_up_terminate_toplevel_wrong;
+commit;
+
+create temp table outer_table                  (a int, b int);
+create temp table restore_buggy_primscan_table (x int, y int);
+
+create index buggy_idx on restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+
+insert into outer_table                  select 1, b_vals from generate_series(1006, 1580) b_vals;
+insert into restore_buggy_primscan_table select 1, x_vals from generate_series(1006, 1580) x_vals;
+
+insert into outer_table                  select 1, 1370 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1371 from generate_series(1, 9) j;
+insert into restore_buggy_primscan_table select 1, 1380 from generate_series(1, 9) j;
+
+vacuum analyze outer_table;
+vacuum analyze restore_buggy_primscan_table;
+
+select count(*), o.a, o.b
+  from
+    outer_table o
+  inner join
+    restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x = 1 and
+  bug.y = any(array[(select array_agg(i) from generate_series(1370, 1390) i where i % 10 = 0)])
+group by o.a, o.b;
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys().  This is handled like the case where the scan
+-- direction changes "within" a page, relying on code from _bt_readnextpage().
+create temp table outer_tab(
+  a int,
+  b int
+);
+create index outer_tab_idx on outer_tab(a, b) with (deduplicate_items = off);
+
+create temp table primscanmarkcov_table(
+  a int,
+  b int
+);
+create index interesting_coverage_idx on primscanmarkcov_table(a, b) with (deduplicate_items = off);
+
+insert into outer_tab             select 1, i from generate_series(1530, 1780) i;
+insert into primscanmarkcov_table select 1, i from generate_series(1530, 1780) i;
+
+insert into outer_tab             select 1, 1550 from generate_series(1, 200) i;
+insert into primscanmarkcov_table select 1, 1551 from generate_series(1, 200) i;
+
+vacuum analyze outer_tab;
+vacuum analyze primscanmarkcov_table ;
+
+with range_ints as ( select i from generate_series(1530, 1780) i)
+
+select
+  count(*), buggy.a, buggy.b from
+outer_tab o
+  inner join
+primscanmarkcov_table buggy
+  on o.a = buggy.a and o.b = buggy.b
+where
+  o.a = 1     and     o.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])  and
+  buggy.a = 1 and buggy.b = any (array[(select array_agg(i) from range_ints where i % 50 = 0)])
+group by buggy.a, buggy.b
+order by buggy.a, buggy.b;
+
+-- Get test coverage for when so->needPrimScan is set at the point of calling
+-- _bt_restore_array_keys() for backwards scans.  More or less comparable to
+-- the last test.
+create temp table backwards_prim_outer_table             (a int, b int);
+create temp table backwards_restore_buggy_primscan_table (x int, y int);
+
+create index backward_prim_buggy_idx  on backwards_restore_buggy_primscan_table (x, y) with (deduplicate_items=off);
+create index backwards_prim_drive_idx on backwards_prim_outer_table             (a, b) with (deduplicate_items=off);
+
+insert into backwards_prim_outer_table                  select 0, 1360;
+insert into backwards_prim_outer_table                  select 1, b_vals from generate_series(1012, 1406) b_vals where b_vals % 10 = 0;
+insert into backwards_prim_outer_table                  select 1, 1370;
+vacuum analyze backwards_prim_outer_table; -- Be tidy
+
+-- Fill up "backwards_prim_drive_idx" index with 396 items, just about fitting
+-- onto its only page, which is a root leaf page:
+insert into backwards_restore_buggy_primscan_table select 0, 1360;
+insert into backwards_restore_buggy_primscan_table select 1, x_vals from generate_series(1012, 1406) x_vals;
+vacuum analyze backwards_restore_buggy_primscan_table; -- Be tidy
+
+-- Now cause two page splits, leaving 4 leaf pages in total:
+insert into backwards_restore_buggy_primscan_table select 1, 1370 from generate_series(1,250) i;
+
+-- Now "buggy" index looks like this:
+--
+-- ┌───┬───────┬───────┬────────┬────────┬────────────┬───────┬───────┬───────────────────┬─────────┬───────────┬──────────────────┐
+-- │ i │ blkno │ flags │ nhtids │ nhblks │ ndeadhblks │ nlive │ ndead │ nhtidschecksimple │ avgsize │ freespace │     highkey      │
+-- ├───┼───────┼───────┼────────┼────────┼────────────┼───────┼───────┼───────────────────┼─────────┼───────────┼──────────────────┤
+-- │ 1 │     1 │     1 │    203 │      1 │          0 │   204 │     0 │                 0 │      16 │     4,068 │ (x, y)=(1, 1214) │
+-- │ 2 │     4 │     1 │    156 │      2 │          0 │   157 │     0 │                 0 │      16 │     5,008 │ (x, y)=(1, 1370) │
+-- │ 3 │     5 │     1 │    251 │      2 │          0 │   252 │     0 │                 0 │      16 │     3,108 │ (x, y)=(1, 1371) │
+-- │ 4 │     2 │     1 │     36 │      1 │          0 │    36 │     0 │                 0 │      16 │     7,428 │ ∅                │
+-- └───┴───────┴───────┴────────┴────────┴────────────┴───────┴───────┴───────────────────┴─────────┴───────────┴──────────────────┘
+
+select count(*), o.a, o.b
+  from
+    backwards_prim_outer_table o
+  inner join
+    backwards_restore_buggy_primscan_table bug
+  on o.a = bug.x and o.b = bug.y
+where
+  bug.x in (0, 1) and
+  bug.y = any(array[(select array_agg(i) from generate_series(1360, 1370) i where i % 10 = 0)])
+group by o.a, o.b
+order by o.a desc, o.b desc;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..9d68ef624 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,18 +774,14 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.43.0

#43

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Peter Geoghegan (#42)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v11, which now says something like that in the commit
message.

Attached is v12.

All that really remains now is to research how we might integrate this
work with the recently added continuescanPrechecked/haveFirstMatch
stuff from Alexander Korotkov, if at all.

The main change in v12 is that I've integrated both the
continuescanPrechecked and the haveFirstMatch optimizations. Both of
these fields are now page-level state, shared between the _bt_readpage
caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
appear next to the new home for _bt_checkkeys' continuescan field, in
the new page state struct).

The haveFirstMatch optimization (not a great name BTW) was the easier
of the two to integrate. I just invalidate the flag (set it to false)
when the array keys advance. It works in exactly the same way as in
the simple no-arrays case, except that there can be more than one
"first match" per page. (This approach was directly enabled by the new
design that came out of Alexander's bugfix commit 7e6fb5da)

The continuescanPrechecked optimization was trickier. The hard part
was avoiding confusion when the start of matches for the current set
of array keys starts somewhere in the middle of the page -- that needs
to be a gating condition on applying the optimization, applied within
_bt_readpage.

continuescanPrechecked
======================

To recap, this precheck optimization is predicated on the idea that
the precheck _bt_checkkeys call tells us all we need to know about
required-in-same-direction scan keys for every other tuple on the page
(assuming continuescan=true is set during the precheck). Critically,
this assumes that there can be no confusion between "before the start
of matching tuples" and "after the end of matching tuples" (in the
presence of required equality strategy scan keys). In other words, it
tacitly assumes that we can't break a very old rule that says that
_bt_first must never call _bt_readpage with an offnum that's before
the start of the first match (the leaf page position located by our
insertion scan key). Although it isn't obvious that this is relevant
at all right now (partly because the _bt_first call to _bt_readpage
won't even use the optimization on the grounds that to do so would
regress point queries), it becomes more obvious once my patch is added
to the mix.

To recap again, making the arrays advance dynamically necessarily
requires that I teach _bt_checkkeys to avoid confusing "before the
start of matching tuples" with "after the end of matching tuples" --
we must give _bt_checkkeys a set of 3-way ORDER procs (and the
necessary context) to avoid that confusion -- even for any non-array
required equality keys.

Putting it all together (having laid the groundwork with my recaps):
This means that it is only safe (for my patch) to even attempt the
precheck optimization when we already know that the array keys cover
keyspace at the start of the page (and if the attempt actually
succeeds, it succeeds because the current array keys cover the entire
page). And so in v12 I track that state in so->scanBehind. In
practice, we only rarely need to avoid the continuescanPrechecked
optimization as a result of this gating condition -- it hardly reduces
the effectiveness of the optimization at all. In fact, it isn't even
possible for so->scanBehind to ever be set during a backward scan.

Saving cycles in_bt_checkkeys with so->scanBehind
-------------------------------------------------

It turns out that this so->scanBehind business is generally useful. We
now expect _bt_advance_array_keys to establish whether or not the
_bt_readpage-wise scan will continue to the end of the current leaf
page up-front, each time the arrays are advanced (advanced past the
tuple being scanned by _bt_readpage). When _bt_advance_array_keys
decides to move onto the next page, it might also set so->scanBehind.
This structure allowed me to simplify the hot code paths within
_bt_checkkeys. Now _bt_checkkeys doesn't need to check the page high
key (or check the first non-pivot tuple in the case of backwards
scans) at all in the very common case where so->scanBehind hasn't been
set by _bt_advance_array_keys.

I mentioned that only forward scans can ever set the so->scanBehind
flag variable. That's because only forward scans can advance the array
keys using the page high key, which (due to suffix truncation) creates
the possibility that the next _bt_readpage will start out on a page
whose earlier tuples are before the real start of matches (for the
current array keys) -- see the comments in _bt_advance_array_keys for
the full explanation.

so->scanBehind summary
----------------------

In summary, so->scanBehind serves two quite distinct purposes:

1. Makes it safe to apply the continuescanPrechecked optimization
during scans that happen to use array keys -- with hardly any changes
required to _bt_readpage to make this work (just testing the
so->scanBehind flag).

2. Gives _bt_checkkeys() a way to know when to check if an earlier
speculative choice (by _bt_advance_array_keys) to move on to the
current leaf page without it yet being 100% clear that its key space
has exact matches for all the scan's equality/array keys (which is
only possible when suffix truncation obscures the issue, hence the
thing about so->scanBehind only ever being set during forward scans).

This speculative choice can only be made during forward scans of a
composite index, whose high key has one or more truncated suffix
attributes that correspond to a required scan key.
_bt_advance_array_keys deems that the truncated attribute "satisfies"
the scan key, but remembers that it wasn't strictly speaking
"satisfied", by setting so->scanBehind (so it is "satisfied" with the
truncated attributes being a "match", but not so satisfied that it's
willing to allow it without also giving _bt_checkkeys a way to back
out if it turns out to have been the wrong choice once on the next
page).

_bt_preprocess_keys contradictory key array advancement
=======================================================

The other big change in v12 concerns _bt_preprocess_keys (not to be
confused with _bt_preprocess_array_keys). It has been taught to treat
the scan's array keys as a distinct type of scan key for the purposes
of detecting redundant and contradictory scan keys. In other words, it
treats scan keys with arrays as their own special type of equality
scan key -- it doesn't treat them as just another type of equality
strategy key in v12. In other other words, _bt_preprocess_keys doesn't
"work at the level of individual primitive index scans" in v12.

Technically this is an optimization that targets cases with many
distinct sets of array keys that have contradictory quals. But I
mostly did this because it has what I now consider to be far better
code structure than what we had in previous versions (I changed my
mind, having previously considered the changes I'd made to
_bt_preprocess_keys for arrays to be less than desirable). And because
it defensively makes the B-Tree code tolerant of scans that have an
absurdly large number of distinct sets of array keys (e.g., 3 arrays
with 1,000 elements, which gives us a billion distinct sets of array
keys in total).

In v12 an index scan can use (say) a billion distinct sets of array
keys, and still expect optimal performance -- even in the corner case
where (for whatever reason) the scan's quals also happen to be
contradictory. There is no longer any risk of calling
_bt_preprocess_keys a billion times, making the query time explode,
all because somebody added an "a = 5 AND a = 6" to the end of the
query's qual. In fact, there isn't ever a need to call
_bt_preprocess_keys() more than once, no matter the details of the
scan -- not now that _bt_preprocess_keys literally operates on whole
arrays as a separate thing to other equality keys. Once you start to
think in terms of array scan keys (not primitive index scans), it
becomes obvious that a single call to _bt_preprocess_keys is all you
really need,

(Note: While we actually do still always call _bt_preprocess_keys from
_bt_first, every call to _bt_preprocess_keys after the first one can
now simply reuse the existing scan keys that were output during the
first call. So effectively there's only one call to
_bt_preprocess_keys required per top-level scan.)

Potential downsides
-------------------

Admittedly, this new approach in _bt_preprocess_keys has some
potential downsides. _bt_preprocess_keys is now largely incapable of
treating quals like "WHERE a IN(1,2,3) AND a = 2" as "partially
contradictory". What it'll actually do is allow the qual to contain
duplicative scan keys (just like when we can't detect redundancy due
to a lack of cross-type support), and leave it all up to
_bt_checkkeys/_bt_advance_array_keys to figure it out later on. (At
one point I actually taught _bt_preprocesses_keys to perform
"incremental advancement" of the scan's arrays itself, which worked,
but that still left the code vulnerable to having to call
_bt_preprocesses_keys an enormous number of times in extreme cases.
That was deemed unacceptable, because it still seemed like a POLA
violation.)

This business with just leaving in some extra scan keys might seem a
little bit sloppy. I thought so myself, for a while, but now I think
that it's actually fine. For the following reasons:

* Even with arrays, contradictory quals like "WHERE a = 4 AND a = 5"
are still detected as contradictory, while quals like "WHERE a = 5 and
a = 5" are still detected as containing a redundancy. We'll only "fail
to detect redundancy/contradictoriness" with cases such as "WHERE a IN
(1, 2, 3) and a = 2" -- cases that (once you buy into this new way of
thinking about arrays and preprocessing) aren't really
contradictory/redundant at all.

* We have _bt_preprocess_array_keys, which was first taught to merge
together redundant array keys in an earlier version of this patch. So,
for example, quals like "WHERE a IN (1, 2, 3) AND a IN (3,4,5)" can be
preprocessed into "WHERE a in (3)". Plus quals like "WHERE a IN (1, 2,
3) AND a IN (18,72)" can be ruled contradictory there and then, in
_bt_preprocess_array_keys , ending the scan immediately.

These cases seem like the really important ones for array
redundancy/contradictoriness. We're not going to miss these array
redundancies.

* The new approach to array key advancement is very good at quickly
eliminating redundant subsets of the array keys that couldn't be
detected as such by preprocessing, due to a lack of suitable
infrastructure. While there is some downside from allowing
preprocessing to fail to detect "partially contradictory" quals, the
new code is so good at advancing the array keys efficiently at runtime
the added overhead is hardly noticeable at all. We're talking maybe
one or two extra descents of the index, for just a subset of all badly
written queries that show some kind of redundancy/contradictoriness
involving array scan keys.

Even if somebody takes the position that this approach isn't
acceptable (not that I expect that anybody will), then the problem
won't be that I've gone too far. The problem will just be that I
haven't gone far enough. If it somehow becomes a blocker to commit,
I'd rather handle the problem by compensating for the minor downsides
that the v12 _bt_preprocess_keys changes arguably create, by adding
more types of array-specific preprocessing to
_bt_preprocess_array_keys. You know, preprocessing that can actually
eliminate a subset of array keys given a qual like "WHERE a IN (1, 2,
3) AND a < 2".

To be clear, I really don't think it's worth adding new types of array
preprocessing to _bt_preprocess_array_keys, just for these cases --
my current approach works just fine, by any relevant metric (even if
you assume that the sorts of array redundancy/contradictoriness we can
miss out on are common and important, it's small potatoes). Just
bringing the idea of adding more stuff to _bt_preprocess_array_keys to
the attention of the group, as an option (this idea might also provide
useful context, that makes my high level thinking on this a bit easier
to follow).

--
Peter Geoghegan

Attachments:

v12-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v12-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 2d39feb81464dd74d22f6dbddd97aaa4c7cbb474 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v12] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with any out-of-core
amcanorder-based index AMs that coevolved with nbtree.  Such an index AM
could have had similar limitations around SOAP execution, and so could
have come to rely on the planner workarounds removed by this commit.
Although it seems unlikely that such an index AM really exists, it still
warrants a pro forma compatibility item in the release notes.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   70 +-
 src/backend/access/nbtree/nbtree.c         |   81 +-
 src/backend/access/nbtree/nbtsearch.c      |  147 +-
 src/backend/access/nbtree/nbtutils.c       | 2342 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   90 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   15 +
 src/test/regress/expected/create_index.out |   33 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   12 +-
 10 files changed, 2417 insertions(+), 500 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052..f265db406 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,7 +960,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1024,7 +1024,6 @@ typedef struct BTArrayKeyInfo
 {
 	int			scan_key;		/* index of associated key in arrayKeyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1038,13 +1037,16 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	bool		scanBehind;		/* Scan might be well behind arrays? */
+	ScanDirection advanceDir;	/* Scan direction when arrays last advanced */
+	bool		needPrimScan;	/* Need primscan to continue in advanceDir? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int		   *orderProcsMap;	/* output scan key's ikey -> ORDER proc map */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1077,49 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* Needed by scans with array keys */
+
+	/* Output parameter, set by _bt_checkkeys for _bt_readpage */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/*
+	 * Input and output parameters, set and unset by both _bt_readpage and
+	 * _bt_checkkeys to manage precheck optimizations.
+	 *
+	 * continuescanPrechecked indicates that continuescan flag was found to be
+	 * set to 'true' during _bt_readpage's precheck of the page's final tuple.
+	 *
+	 * It follows that any required scan keys that were satisfied during the
+	 * precheck (i.e. any scan keys that contributed to continuescan being set
+	 * to 'true') will also be satisfied by any corresponding attribute values
+	 * from earlier tuples on the same page.
+	 */
+	bool		continuescanPrechecked;
+
+	/*
+	 * haveFirstMatch indicates that we already have at least one match for an
+	 * earlier tuple from the current page (with current required array keys).
+	 *
+	 * It follows that any scan keys that are required in the opposite scan
+	 * direction _only_ during the precheck were satisfied by the tuple, and
+	 * that they'll also be satisfied by corresponding attribute values from
+	 * any later tuples on the same page.  (This optimization doesn't require
+	 * an explicit precheck; an ordinary _bt_checkkeys call can suffice.)
+	 *
+	 * Note: there can be more than one "first match" per page in scans that
+	 * use equality array keys.  _bt_advance_array_keys must invalidate the
+	 * flag every time the scan's required array keys advance.
+	 */
+	bool		haveFirstMatch;
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1082,6 +1127,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_RDDNARRAY	0x00040000	/* redundant in array preprocessing */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1152,7 +1198,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1245,13 +1291,11 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_rewind_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+						  IndexTuple tuple, int tupnatts);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 21d879a3b..9311871d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -220,7 +220,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		_bt_start_array_keys(scan, dir);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -262,8 +262,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -290,7 +290,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 		_bt_start_array_keys(scan, ForwardScanDirection);
 	}
 
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -320,8 +320,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -351,9 +351,13 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->scanBehind = false;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
+	so->orderProcsMap = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -393,7 +397,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -492,10 +498,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -506,10 +508,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -548,6 +546,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Rewind the scan's array keys, if any */
+			if (so->numArrayKeys)
+				_bt_rewind_array_keys(scan);
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -574,7 +575,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -600,7 +601,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -611,7 +612,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -642,16 +647,16 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  The top-level index
+			 * scan might require additional primitive index scans.
 			 */
 			status = false;
 		}
@@ -683,9 +688,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -719,12 +727,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -738,14 +745,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -754,13 +761,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 63ee9ba22..f371670c4 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,7 +907,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -1527,11 +1527,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
-	bool		continuescanPrechecked;
-	bool		haveFirstMatch = false;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1551,8 +1550,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.continuescanPrechecked = false;
+	pstate.haveFirstMatch = false;
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	arrayKeys = so->numArrayKeys != 0;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1598,10 +1603,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * corresponding value from the last item on the page.  So checking with
 	 * the last item on the page would give a more precise answer.
 	 *
-	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * We skip this for the scan's first page to avoid slowing down point
+	 * queries.  We also have to avoid applying the optimization in rare cases
+	 * where it's not yet clear that the scan is at or ahead of its current
+	 * array keys.  If we're behind, but not too far behind (the start of
+	 * tuples matching the current keys is somewhere before the last item),
+	 * then the optimization is unsafe.
+	 *
+	 * Cases with multiple distinct sets of required array keys for key space
+	 * from the same leaf page can _attempt_ to use the precheck optimization,
+	 * though.  It won't work out, but there's no better way of figuring that
+	 * out than just optimistically attempting the precheck.
+	 *
+	 * The array keys safety issue is related to our reliance on _bt_first
+	 * passing us an offnum that's exactly at the beginning of where equal
+	 * tuples are to be found.  The underlying problem is that we have no
+	 * built-in ability to tell the difference between the start of required
+	 * equality matches and the end of required equality matches.  Array key
+	 * advancement within _bt_checkkeys has to act as a "_bt_first surrogate"
+	 * whenever the start of tuples matching the next set of array keys is
+	 * close to the end of tuples matching the current/last set of array keys.
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !so->scanBehind && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1633,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Do the precheck, while avoiding advancing the scan's array keys
+		 * prematurely
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, false, itup, indnatts);
+		pstate.continuescanPrechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,23 +1675,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.haveFirstMatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1696,7 +1712,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1729,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			pstate.continuescanPrechecked = false;	/* prechecked other tuple */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1750,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys && minoff <= maxoff && !P_LEFTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1772,23 +1797,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.haveFirstMatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1824,7 +1839,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1999,6 +2014,21 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the right
+		 */
+		if (ScanDirectionIsBackward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys > 0);
+
+			so->currPos.moreRight = true;
+			so->advanceDir = dir;
+			so->scanBehind = false;
+			so->needPrimScan = false;
+		}
+
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 	}
@@ -2007,6 +2037,21 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreRight = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the left
+		 */
+		if (ScanDirectionIsForward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys > 0);
+
+			so->currPos.moreLeft = true;
+			so->advanceDir = dir;
+			so->scanBehind = false;
+			so->needPrimScan = false;
+		}
+
 		if (scan->parallel_scan != NULL)
 		{
 			/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2e6fc14d7..cf038a6a3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,23 +33,59 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *sortproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
 
+typedef struct ScanKeyAttr
+{
+	ScanKey		skey;
+	int			ikey;
+} ScanKeyAttr;
+
+static void _bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+								FmgrInfo *orderproc, FmgrInfo **sortprocp);
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
-									  StrategyNumber strat,
+									  Oid elemtype, StrategyNumber strat,
 									  Datum *elems, int nelems);
-static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-									bool reverse,
-									Datum *elems, int nelems);
+static int	_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc,
+									bool reverse, Datum *elems, int nelems);
+static int	_bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+										 IndexTuple tuple, bool readpagetup,
+										 int sktrig, bool *scanBehind);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int sktrig);
+static void _bt_primitive_scan_inequal_nonrequired(IndexScanDesc scan, ScanDirection dir,
+												   int opsktrig);
+static void _bt_update_keys_with_arraykeys(IndexScanDesc scan);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool arrayKeys, bool continuescanPrechecked,
+							  bool haveFirstMatch,
+							  bool *continuescan, int *ikey);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -190,21 +226,49 @@ _bt_freestack(BTStack stack)
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
  * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * array elements.
+ *
+ * We perform all preprocessing that naturally works "across array keys".
+ * _bt_preprocess_keys also deals with some aspects of the scan's array keys,
+ * particular those aspects that are sensitive to the current array keys.
+ * This division of labor makes sense once you consider that we're called only
+ * once per btrescan, whereas _bt_preprocess_keys is called once per primitive
+ * index scan.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array key as redundant.  Similarly, we are capable of "merging
+ * together" multiple equality array keys (from two or more input scan keys)
+ * into a single output scan key that contains only the intersecting array
+ * elements.  This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.  It can also allow us to
+ * detect contradictory quals early.
+ *
+ * Note: _bt_start_array_keys actually sets up the cur_elem counters later on,
+ * once the scan direction is known.
  *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
+ *
+ * Note: _bt_preprocess_keys is responsible for creating the so->keyData scan
+ * keys used by _bt_checkkeys.  Index scans that don't use equality array keys
+ * will have _bt_preprocess_keys treat scan->keyData as input and so->keyData
+ * as output.  Scans that use equality array keys have _bt_preprocess_keys
+ * treat so->arrayKeyData (which is our output) as their input, while (as per
+ * usual) outputting so->keyData for _bt_checkkeys.
  */
 void
 _bt_preprocess_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
 	int			numberOfKeys = scan->numberOfKeys;
-	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16	   *indoption = rel->rd_indoption;
 	int			numArrayKeys;
+	int			prevArrayAtt = -1;
+	Oid			prevElemtype = InvalidOid;
 	ScanKey		cur;
 	int			i;
 	MemoryContext oldContext;
@@ -250,18 +314,25 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	oldContext = MemoryContextSwitchTo(so->arrayContext);
 
 	/* Create modifiable copy of scan->keyData in the workspace context */
-	so->arrayKeyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-	memcpy(so->arrayKeyData,
-		   scan->keyData,
-		   scan->numberOfKeys * sizeof(ScanKeyData));
+	so->arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
+	memcpy(so->arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
 
 	/* Allocate space for per-array data in the workspace context */
-	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->advanceDir = NoMovementScanDirection;
+
+	/* Allocate space for ORDER procs that we'll use to advance the arrays */
+	so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+	so->orderProcsMap = (int *) palloc(numberOfKeys * sizeof(int));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
 	for (i = 0; i < numberOfKeys; i++)
 	{
+		FmgrInfo	sortproc;
+		FmgrInfo   *sortprocp = &sortproc;
+		bool		reverse;
+		Oid			elemtype;
 		ArrayType  *arrayval;
 		int16		elmlen;
 		bool		elmbyval;
@@ -273,6 +344,31 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+		reverse = (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0;
+
+		/*
+		 * Determine the nominal datatype of the array elements.  We have to
+		 * support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this is a hack to simplify life for
+		 * ScanKeyInit().
+		 */
+		elemtype = cur->sk_subtype;
+		if (elemtype == InvalidOid)
+			elemtype = rel->rd_opcintype[cur->sk_attno - 1];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way ORDER proc to perform binary
+		 * searches for the next matching array element.  Set that up now.
+		 *
+		 * Array scan keys with cross-type equality operators will require a
+		 * separate same-type ORDER proc for sorting their array.  Otherwise,
+		 * sortproc just points to the same proc used during binary searches.
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+			_bt_setup_array_cmp(scan, cur, elemtype,
+								&so->orderProcs[i], &sortprocp);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -320,7 +416,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTLessStrategyNumber:
 			case BTLessEqualStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTGreaterStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -330,7 +426,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTGreaterEqualStrategyNumber:
 			case BTGreaterStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTLessStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -343,12 +439,59 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/*
 		 * Sort the non-null elements and eliminate any duplicates.  We must
 		 * sort in the same ordering used by the index column, so that the
-		 * successive primitive indexscans produce data in index order.
+		 * arrays can be advanced in lockstep with the scan's progress through
+		 * the index's key space.
 		 */
-		num_elems = _bt_sort_array_elements(scan, cur,
-											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+		Assert(cur->sk_strategy == BTEqualStrategyNumber);
+		num_elems = _bt_sort_array_elements(cur, sortprocp, reverse,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and whole scan keys).
+		 *
+		 * We don't support merging arrays (for same-attribute scankeys) when
+		 * the array element types don't match.  Note that this is orthogonal
+		 * to whether cross-type operators are used (whether the element type
+		 * matches or fails to match the on-disk/opclass type is irrelevant).
+		 */
+		if (prevArrayAtt == cur->sk_attno && prevElemtype == elemtype)
+		{
+			BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+			Assert(so->arrayKeyData[prev->scan_key].sk_attno == cur->sk_attno);
+			Assert(so->arrayKeyData[prev->scan_key].sk_func.fn_oid ==
+				   cur->sk_func.fn_oid);
+			Assert(so->arrayKeyData[prev->scan_key].sk_collation ==
+				   cur->sk_collation);
+
+			num_elems = _bt_merge_arrays(cur, sortprocp, reverse,
+										 prev->elem_values, prev->num_elems,
+										 elem_values, num_elems);
+
+			pfree(elem_values);
+
+			/*
+			 * If there are no intersecting elements left from merging this
+			 * array into the previous array on the same attribute, the scan
+			 * qual is unsatisfiable
+			 */
+			if (num_elems == 0)
+			{
+				numArrayKeys = -1;
+				break;
+			}
+
+			/*
+			 * Lower the number of elements from the previous array, and mark
+			 * this scan key/array as redundant for every primitive index scan
+			 */
+			prev->num_elems = num_elems;
+			cur->sk_flags |= SK_BT_RDDNARRAY;
+			continue;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
 		 */
@@ -356,6 +499,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		prevArrayAtt = cur->sk_attno;
+		prevElemtype = elemtype;
 	}
 
 	so->numArrayKeys = numArrayKeys;
@@ -363,6 +508,93 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	MemoryContextSwitchTo(oldContext);
 }
 
+/*
+ * _bt_setup_array_cmp() -- Set up array comparison functions
+ *
+ * Sets ORDER proc in caller's orderproc argument, which is used during binary
+ * searches of arrays during the index scan.  Also sets a same-type ORDER proc
+ * in caller's *sortprocp argument.
+ *
+ * Caller should pass an orderproc pointing to space that'll store the ORDER
+ * proc for the scan, and a *sortprocp pointing to its own separate space.
+ *
+ * In the common case where we don't need to deal with cross-type operators,
+ * only one ORDER proc is actually required by caller.  We'll set *sortprocp
+ * to point to the same memory that caller's orderproc continues to point to.
+ * Otherwise, *sortprocp will continue to point to separate memory, which
+ * we'll initialize separately (with an "(elemtype, elemtype)" ORDER proc that
+ * can be used to sort arrays).
+ *
+ * Array preprocessing calls here with all equality strategy scan keys,
+ * including any that don't use an array at all.  See _bt_advance_array_keys
+ * for an explanation of why we need to treat these as degenerate single-value
+ * arrays when the scan advances its arrays.
+ */
+static void
+_bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+					FmgrInfo *orderproc, FmgrInfo **sortprocp)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	RegProcedure cmp_proc;
+	Oid			opclasstype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
+
+	/*
+	 * Look up the appropriate comparison function in the opfamily.  This must
+	 * use the opclass type as its left hand arg type, and the array element
+	 * as its right hand arg type (since binary searches search for the array
+	 * value that best matches the next on-disk index tuple for the scan).
+	 *
+	 * Note: it's possible that this would fail, if the opfamily lacks the
+	 * required cross-type ORDER proc.  But this is no different to the case
+	 * where _bt_first fails to find an ORDER proc for its insertion scan key.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 opclasstype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, opclasstype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+
+	if (opclasstype == elemtype || !(skey->sk_flags & SK_SEARCHARRAY))
+	{
+		/*
+		 * A second opfamily support proc lookup can be avoided in the common
+		 * case where the ORDER proc used for the scan's binary searches uses
+		 * the opclass/on-disk datatype for both its left and right arguments.
+		 *
+		 * Also avoid a separate lookup whenever scan key lacks an array.
+		 * There is nothing for caller to sort anyway, but be consistent.
+		 */
+		*sortprocp = orderproc;
+		return;
+	}
+
+	/*
+	 * Look up the appropriate same-type comparison function in the opfamily.
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but it seems quite unlikely that an opfamily would omit
+	 * non-cross-type support functions for any datatype that it supports at
+	 * all.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 elemtype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, elemtype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set same-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, *sortprocp, so->arrayContext);
+}
+
 /*
  * _bt_find_extreme_element() -- get least or greatest array element
  *
@@ -371,27 +603,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
  * least element, or BTGreaterStrategyNumber to get the greatest.
  */
 static Datum
-_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
+_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey, Oid elemtype,
 						 StrategyNumber strat,
 						 Datum *elems, int nelems)
 {
 	Relation	rel = scan->indexRelation;
-	Oid			elemtype,
-				cmp_op;
+	Oid			cmp_op;
 	RegProcedure cmp_proc;
 	FmgrInfo	flinfo;
 	Datum		result;
 	int			i;
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
 	/*
 	 * Look up the appropriate comparison operator in the opfamily.
 	 *
@@ -400,6 +622,8 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	 * non-cross-type comparison operators for any datatype that it supports
 	 * at all.
 	 */
+	Assert(skey->sk_strategy != BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
 	cmp_op = get_opfamily_member(rel->rd_opfamily[skey->sk_attno - 1],
 								 elemtype,
 								 elemtype,
@@ -434,50 +658,26 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
  * The array elements are sorted in-place, and the new number of elements
  * after duplicate removal is returned.
  *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * skey identifies the index column whose opfamily determines the comparison
+ * semantics, and sortproc is a corresponding ORDER proc.  If reverse is true,
+ * we sort in descending order.
+ *
+ * Note: sortproc arg must be an ORDER proc suitable for sorting: it must
+ * compare arguments that are both of the same type as the array elements
+ * being sorted (even during scans that perform binary searches against the
+ * arrays using distinct cross-type ORDER procs).
  */
 static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
+_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -488,6 +688,47 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTSortArrayContext cxt;
+	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	int			merged_nelems = 0;
+
+	/*
+	 * Incrementally copy the original array into a temp buffer, skipping over
+	 * any items that are missing from the "next" array
+	 */
+	cxt.sortproc = sortproc;
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0; i < nelems_orig; i++)
+	{
+		Datum	   *elem = elems_orig + i;
+
+		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+						_bt_compare_array_elements, &cxt))
+			merged[merged_nelems++] = *elem;
+	}
+
+	/*
+	 * Overwrite the original array with temp buffer so that we're only left
+	 * with intersecting array elements
+	 */
+	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	pfree(merged);
+
+	return merged_nelems;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -499,7 +740,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->sortproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -507,6 +748,155 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound for backward
+ * scans).  It's safe for searches against required scan key arrays to reuse
+ * earlier search bounds like this because such arrays always advance in
+ * lockstep with the index scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem = 0,
+				mid_elem = -1,
+				high_elem = array->num_elems - 1,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur_elem_start)
+	{
+		if (ScanDirectionIsForward(dir))
+			low_elem = array->cur_elem;
+		else
+			high_elem = array->cur_elem;
+	}
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * It's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the low_elem datum.  Must always set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -532,141 +922,1183 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
 
-	so->arraysStarted = true;
+	so->advanceDir = dir;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
+ *
+ * Note: routine only sets so->arrayKeyData[] "input" scankeys to incremented
+ * element values.  It will not set the same values in the scan's search-type
+ * so->keyData[] "output" scan keys.  _bt_update_keys_with_arraykeys needs to
+ * be called to actually change the qual used by _bt_checkkeys to decide which
+ * tuples it should return.
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
 		int			cur_elem = curArrayKey->cur_elem;
 		int			num_elems = curArrayKey->num_elems;
+		bool		rolled = false;
 
-		if (ScanDirectionIsBackward(dir))
+		if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
 		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = 0;
+			rolled = true;
 		}
-		else
+		else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
 		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = num_elems - 1;
+			rolled = true;
 		}
 
 		curArrayKey->cur_elem = cur_elem;
 		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
+		if (!rolled)
+			return true;
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+		/* Need to advance next array key, if any */
+	}
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * The array keys are now exhausted.
+	 *
+	 * There isn't actually a distinct state that represents array exhaustion,
+	 * since index scans don't always end when btgettuple returns "false". The
+	 * scan direction might be reversed, or the scan might yet have its last
+	 * saved position restored.
+	 *
+	 * Restore the array keys to the state they were in immediately before we
+	 * were called.  This ensures that the arrays can only ever ratchet in the
+	 * scan's current direction.  Without this, scans would overlook matching
+	 * tuples if and when the scan's direction was subsequently reversed.
 	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
-}
-
-/*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
- *
- * Save the current state of the array keys as the "mark" position.
- */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
-
-	for (i = 0; i < so->numArrayKeys; i++)
-	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
-	}
-}
-
-/*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
- *
- * Restore the array keys to where they were when the mark was set.
- */
-void
-_bt_restore_array_keys(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		changed = false;
-	int			i;
-
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	for (int i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
 		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		if (ScanDirectionIsBackward(dir))
+			curArrayKey->cur_elem = 0;
+		else
+			curArrayKey->cur_elem = curArrayKey->num_elems - 1;
+		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
+	}
+
+	return false;
+}
+
+/*
+ * _bt_rewind_array_keys() -- Handle array keys during btrestrpos
+ *
+ * Restore the array keys to the start of the key space for the current scan
+ * direction as of the last time the arrays advanced.
+ *
+ * Once the scan reaches _bt_advance_array_keys, the arrays will advance up to
+ * the key space of the actual tuples from the mark position's leaf page.
+ */
+void
+_bt_rewind_array_keys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		changed = false;
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->advanceDir != NoMovementScanDirection);
+
+	for (int i = 0; i < so->numArrayKeys; i++)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			first_elem_dir;
+
+		if (ScanDirectionIsForward(so->advanceDir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = curArrayKey->num_elems - 1;
+
+		if (curArrayKey->cur_elem != first_elem_dir)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
+			curArrayKey->cur_elem = first_elem_dir;
+			skey->sk_argument = curArrayKey->elem_values[first_elem_dir];
 			changed = true;
 		}
 	}
 
+	if (changed)
+		_bt_update_keys_with_arraykeys(scan);
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
+	 * Invert the scan direction as of the last time the array keys advanced.
 	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * In the common case where the scan direction hasn't changed, this won't
+	 * affect the behavior of the scan at all -- advanceDir will be reset to
+	 * the current scan direction in the next call to _bt_advance_array_keys.
+	 *
+	 * This prevents _bt_steppage from fully trusting currPos.moreRight and
+	 * currPos.moreLeft in cases where _bt_readpage/_bt_checkkeys don't get
+	 * the opportunity to consider advancing the array keys as expected.
 	 */
-	if (changed || !so->arraysStarted)
-	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
+	if (ScanDirectionIsForward(so->advanceDir))
+		so->advanceDir = BackwardScanDirection;
+	else
+		so->advanceDir = ForwardScanDirection;
+
+	so->scanBehind = true;
+	so->needPrimScan = false;	/* defensive */
 }
 
+/*
+ * _bt_tuple_before_array_skeys() -- _bt_checkkeys array helper function
+ *
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) must advance the scan's array keys.
+ * Only call here when _bt_check_compare already set continuescan=false.
+ * _bt_checkkeys calls here (in scans with array equality scan keys) to deal
+ * with _bt_check_compare's inability to distinguishing between the < and >
+ * cases (it uses equality operator scan keys, not 3-way ORDER procs).
+ *
+ * We always compare the tuple using the current array keys.  "readpagetup"
+ * indicates if tuple is the scan's current _bt_readpage-wise tuple, rather
+ * than a finaltup precheck/an assertion.  (!readpagetup finaltup precheck
+ * callers won't have actually called _bt_check_compare for finaltup before
+ * calling here, but they only call here to determine if the beginning of
+ * matches for the current set of array keys at least starts somewhere on the
+ * scan's current leaf page.)
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it isn't time to advance the array keys just yet
+ * (during readpagetup calls).  Our readpagetup caller must then suppress its
+ * initial _bt_check_compare call (by setting pstate.continuescan=true once
+ * more), allowing the scan to move on to the next _bt_readpage-wise tuple.
+ * (In the case of !readpagetup finaltup precheck callers, this just indicates
+ * that the start of matches for the current set of required array keys isn't
+ * even on this page.)
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This confirms that readpagetup caller's
+ * _bt_check_compare set pstate.continuescan=false due to locating the true
+ * end of matching tuples for current qual (not some point before the start of
+ * matching tuples), which means it's time to advance any required array keys,
+ * and consider if tuple is a match for the new post-array-advancement qual.
+ * (In the case of !readpagetup finaltup precheck callers, this just indicates
+ * that the start of matches for the current set of required array keys must
+ * be somewhere on the scan's current page.  We don't consider the possible
+ * influence of required-in-opposite-direction-only inequality scan keys on
+ * the initial position that _bt_first would locate for the current qual.  It
+ * is up to _bt_advance_array_keys to deal with that as a special case.)
+ *
+ * As an optimization, readpagetup callers pass a _bt_check_compare-set sktrig
+ * value to indicate which scan key triggered _bt_checkkeys to recheck with us
+ * (!readpagetup callers must always pass sktrig=0).  This allows us to avoid
+ * wastefully checking earlier scan keys that _bt_check_compare already found
+ * to be satisfied by the current qual/set of array keys.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+							 IndexTuple tuple, bool readpagetup, int sktrig,
+							 bool *scanBehind)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel);
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+	Assert(sktrig == 0 || readpagetup);
+
+	if (scanBehind)
+		*scanBehind = false;
+
+	for (; sktrig < so->numberOfKeys; sktrig++)
+	{
+		ScanKey		cur = so->keyData + sktrig;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
+
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) == 0)
+			return false;
+
+		/* readpagetup calls require one ORDER proc comparison (at most) */
+		Assert(!readpagetup || cur == so->keyData + sktrig);
+
+		if (cur->sk_attno > ntupatts)
+		{
+			Assert(!readpagetup);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys, forcing another _bt_advance_array_keys call.
+			 *
+			 * You might wonder why we don't treat truncated attributes as
+			 * having values < our equality constraints instead; we're not
+			 * treating the truncated attributes as having -inf values here,
+			 * which is how things are done in _bt_compare.  See the same
+			 * point in _bt_advance_array_keys for the explanation.
+			 */
+			if (scanBehind)
+				*scanBehind = true;
+
+			return false;
+		}
+
+		/*
+		 * Inequality strategy scan keys (that are required in current scan
+		 * direction) are dealt with by _bt_check_compare
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+		{
+			/*
+			 * We must give up right away when this was caller's trigger scan
+			 * key, to avoid confusing our assertions
+			 */
+			if (readpagetup)
+				return false;
+
+			continue;
+		}
+
+		orderproc = &so->orderProcs[so->orderProcsMap[sktrig]];
+		tupdatum = index_getattr(tuple, cur->sk_attno, itupdesc, &tupnull);
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?  (Must be if we get here during a readpagetup call.)
+		 */
+		if (readpagetup || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconclusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or a call made from an assertion.
+		 */
+		Assert(result == 0);
+		Assert(!readpagetup);
+	}
+
+	return false;
+}
+
+/*
+ * _bt_array_keys_remain() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys,
+ * after _bt_first or _bt_next return false.
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys > 0);
+	Assert(so->advanceDir == dir);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  The flag tells us what to do.
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys -- even when
+	 * the scan has a satisfiable qual.  There are various cases where that
+	 * won't happen.  For example, if the index is completely empty, then
+	 * _bt_first will never call _bt_readpage/_bt_checkkeys.  We also don't
+	 * expect a call to _bt_checkkeys during searches for a non-existent value
+	 * that happens to be lower/higher than any existing value in the index.
+	 *
+	 * We don't require special handling for these cases -- we don't need to
+	 * be explicitly instructed to _not_ perform another primitive index scan.
+	 * It's up to code under the control of _bt_first to always set the flag
+	 * when another primitive index scan will be required.
+	 *
+	 * This works correctly, even with the tricky cases listed above, which
+	 * all involve access to leaf pages "near the boundaries of the key space"
+	 * (whether it's from a leftmost/rightmost page, or an imaginary empty
+	 * leaf root page).  If _bt_checkkeys cannot be reached by a primitive
+	 * index scan for one set of array keys, then it also won't be reached for
+	 * any later set ("later" in terms of the direction that we scan the index
+	 * and advance the arrays).  The array keys won't have advanced in these
+	 * cases, but that's the correct behavior (even _bt_advance_array_keys
+	 * won't always advance the arrays at the point they become "exhausted").
+	 */
+	if (so->needPrimScan)
+	{
+		/* Flag was set -- must call _bt_first again */
+		so->needPrimScan = false;
+		so->scanBehind = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		Assert(_bt_verify_arrays_bt_first(scan, dir));
+
+		return true;
+	}
+
+	/* The top-level index scan ran out of tuples in this scan direction */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * The scan always gets a new qual as a consequence of calling here (except
+ * when we determine that the top-level scan has run out of matching tuples).
+ * All later _bt_check_compare calls also use the same new qual that was first
+ * used here (at least until the next call here advances the keys once again).
+ * It's convenient to structure _bt_check_compare rechecks of caller's tuple
+ * (using the new qual) as one the steps of advancing the scan's array keys,
+ * so this function works as a wrapper around _bt_check_compare.
+ *
+ * Like _bt_check_compare, we'll set pstate.continuescan on behalf of the
+ * caller, and return a boolean indicating if caller's tuple satisfies the
+ * scan's new qual.  But unlike _bt_check_compare, we set so->needPrimScan
+ * when we set continuescan=false, indicating if a new primitive index scan
+ * has been scheduled (otherwise, the top-level scan has run out of tuples in
+ * the current scan direction).
+ *
+ * Caller must use _bt_tuple_before_array_skeys to determine if the current
+ * place in the scan is >= the current array keys _before_ calling here.
+ * We're responsible for ensuring that caller's tuple is <= the newly advanced
+ * required array keys once we return.  We try to find an exact match, but
+ * failing that we'll advance the array keys to whatever set of array elements
+ * comes next in the key space for the current scan direction.  Required array
+ * keys "ratchet forwards" (or backwards).  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The rules are the same for backwards scans, except that the operators are
+ * flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Sometimes that'll
+ * be the sole reason for calling here.  These calls are the only exception to
+ * the general rule about always advancing the array keys.  That guarantee
+ * only really applies when a required scan key was found to be unsatisfied.
+ *
+ * Note that we deal with non-array required equality strategy scan keys as
+ * degenerate single element arrays here.  Obviously, they can never really
+ * advance in the way that real arrays can, but they must still affect how we
+ * advance real array scan keys (exactly like true array equality scan keys).
+ * We have to keep around a 3-way ORDER proc for these (using the "=" operator
+ * won't do), since in general whether the tuple is < or > _any_ unsatisfied
+ * required equality key influences how the scan's real arrays must advance.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple.  This is how we deal with inequalities that are
+ * required in the current scan direction.  They can advance the array keys
+ * here, even though they don't influence the initial positioning strategy
+ * within _bt_first (only inequalities required in the _opposite_ direction to
+ * the scan influence _bt_first in this way).  When sktrig (which is an offset
+ * to the unsatisfied scan key set by _bt_check_compare) is for a required
+ * inequality scan key, we'll perform array key advancement.  We don't need to
+ * keep around a separate 3-way ORDER proc for these scan keys, though (there
+ * is no special need to disambiguate < from > with inequalities).
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	int			arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_exhausted,
+				beyond_end_advance = false,
+				sktrig_required = false,
+				has_required_opposite_direction_only = false,
+				oppodir_inequality_sktrig = false,
+				all_required_satisfied = true;
+
+	/*
+	 * Precondition array state assertions
+	 */
+	Assert(!so->needPrimScan);
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, false, 0, NULL));
+
+	so->scanBehind = false;		/* Reset */
+
+	/*
+	 * Iterate through the scan's search-type scankeys (so->keyData[]), and
+	 * set input scan keys (so->arrayKeyData[]) to new array values
+	 */
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		FmgrInfo   *orderproc;
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		Datum		tupdatum;
+		bool		required = false,
+					required_opposite_direction_only = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			/* Manage array state */
+			if (cur->sk_flags & SK_SEARCHARRAY)
+			{
+				Assert(arrayidx < so->numArrayKeys);
+				array = &so->arrayKeys[arrayidx++];
+				skeyarray = &so->arrayKeyData[array->scan_key];
+				Assert(skeyarray->sk_attno == cur->sk_attno);
+			}
+		}
+		else
+		{
+			/*
+			 * Are any inequalities required in the opposite direction only
+			 * present here?
+			 */
+			if (((ScanDirectionIsForward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQBKWD))) ||
+				 (ScanDirectionIsBackward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQFWD)))))
+				has_required_opposite_direction_only =
+					required_opposite_direction_only = true;
+		}
+
+		/* Optimization: skip over known-satisfied scan keys */
+		if (ikey < sktrig)
+			continue;
+
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			Assert(ikey == sktrig || sktrig_required);
+
+			required = true;
+
+			if (ikey == sktrig)
+				sktrig_required = true;
+
+			if (cur->sk_attno > ntupatts)
+			{
+				/* Set this just like _bt_tuple_before_array_skeys */
+				Assert(sktrig < ikey);
+				so->scanBehind = true;
+			}
+		}
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now. (We only do comparisons of any required
+		 * non-array equality scan keys that come after the triggering key.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; we aren't capable of directly
+		 * evaluating required inequality strategy scan keys here, on our own.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(required);
+			Assert(all_required_satisfied);
+			Assert(!arrays_advanced);
+			Assert(cur->sk_attno <= ntupatts);
+
+			/* Use "beyond end" advancement.  See below for an explanation. */
+			beyond_end_advance = true;
+			all_required_satisfied = false;
+
+			/*
+			 * Set a flag that remembers that this was an inequality required
+			 * in the opposite scan direction only, that nevertheless
+			 * triggered the call here.
+			 *
+			 * This only happens when an inequality operator (which must be
+			 * strict) encounters a group of NULLs that indicate the end of
+			 * non-NULL values for tuples in the current scan direction.
+			 */
+			if (unlikely(required_opposite_direction_only))
+				oppodir_inequality_sktrig = true;
+
+			continue;
+		}
+
+		/*
+		 * Nothing more for us to do with an inequality strategy scan key that
+		 * wasn't the one that _bt_check_compare stopped on, though.
+		 *
+		 * Note: if our later call to _bt_check_compare (to recheck caller's
+		 * tuple) sets continuescan=false due to finding this same inequality
+		 * unsatisfied (possible when it's required in the scan direction), we
+		 * deal with it via a recursive call.
+		 */
+		else if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Nothing for us to do with an equality strategy scan key that isn't
+		 * marked required, either.
+		 *
+		 * Non-required array scan keys are the only exception.  They're a
+		 * special case in that _bt_check_compare can set continuescan=false
+		 * for them, just as it will given an unsatisfied required scan key.
+		 * It's convenient to follow the same convention, since it results in
+		 * our getting called to advance non-required arrays in the same way
+		 * as required arrays (though we avoid stopping the scan for them).
+		 */
+		else if (!required && !array)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				skeyarray->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_required_satisfied || cur->sk_attno > ntupatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		orderproc = &so->orderProcs[so->orderProcsMap[ikey]];
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		ratchets = (required && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+											  tupdatum, tupnull,
+											  array, cur, &result);
+
+			/*
+			 * Required arrays only ever ratchet forwards (backwards).
+			 *
+			 * This condition makes it safe for binary searches to skip over
+			 * array elements that the scan must already be ahead of by now.
+			 * That is strictly an optimization.  Our assertion verifies that
+			 * the condition holds, which doesn't depend on the optimization.
+			 */
+			Assert(!ratchets ||
+				   ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+					(ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(required);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling for required scan keys that lack
+		 * a real array to advance, nor for redundant scan keys that couldn't
+		 * be eliminated by _bt_preprocess_keys.  It won't matter if some of
+		 * our "true" array scan keys (or even all of them) are non-required.
+		 */
+		if (required &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		if (result != 0)
+		{
+			/*
+			 * Track whether caller's tuple satisfies our new post-advancement
+			 * qual, though only in respect of its required scan keys.
+			 *
+			 * When it's a non-required array that doesn't match, we can give
+			 * up early, without advancing the array (nor any later
+			 * non-required arrays).  This often saves us an unnecessary
+			 * recheck call to _bt_check_compare.
+			 */
+			Assert(all_required_satisfied);
+			if (required)
+				all_required_satisfied = false;
+			else
+				break;
+		}
+
+		/* Advance array keys, even when set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+		}
+	}
+
+	/*
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is how top-level index
+	 * scan usually determine that they've run out of tuples to return.
+	 */
+	arrays_exhausted = false;
+	if (beyond_end_advance)
+	{
+		Assert(!all_required_satisfied);
+
+		if (_bt_advance_array_keys_increment(scan, dir))
+			arrays_advanced = true;
+		else
+			arrays_exhausted = true;
+	}
+
+	if (arrays_advanced)
+	{
+		/*
+		 * Finalize advancing the array keys by performing in-place updates to
+		 * the associated array search-type scan keys that _bt_checkkeys uses
+		 */
+		_bt_update_keys_with_arraykeys(scan);
+		so->advanceDir = dir;
+
+		if (sktrig_required)
+		{
+			/*
+			 * One or more required array keys advanced, so invalidate state
+			 * that tracks whether required-in-opposite-direction-only scan
+			 * keys are already known to be satisfied
+			 */
+			pstate->haveFirstMatch = false;
+
+			/* Shouldn't have to invalidate continuescanPrechecked, though */
+			Assert(!pstate->continuescanPrechecked);
+		}
+	}
+	else
+		Assert(arrays_exhausted || !sktrig_required);
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	if (arrays_exhausted)
+	{
+		Assert(sktrig_required);
+		Assert(!all_required_satisfied);
+
+		/*
+		 * The top-level index scan ran out of tuples to return
+		 */
+		goto end_toplevel_scan;
+	}
+
+	/*
+	 * Does caller's tuple now match the new qual?  Call _bt_check_compare a
+	 * second time to find out (unless it's already clear that it can't).
+	 */
+	if (all_required_satisfied && arrays_advanced)
+	{
+		int			nsktrig = sktrig + 1;
+
+		if (_bt_check_compare(dir, so, tuple, ntupatts, tupdesc,
+							  false, false, false,
+							  &pstate->continuescan, &nsktrig) &&
+			!so->scanBehind)
+		{
+			/* This tuple satisfies the new qual */
+			return true;
+		}
+
+		/*
+		 * Consider "second pass" handling of required inequalities.
+		 *
+		 * It's possible that our _bt_check_compare call indicated that the
+		 * scan should end due to some unsatisfied inequality that wasn't
+		 * initially recognized as such by us.  Handle this by calling
+		 * ourselves recursively, this time indicating that the trigger is the
+		 * inequality that we missed first time around (and using a set of
+		 * required array/equality keys that are now exact matches for tuple).
+		 *
+		 * We make a strong, general guarantee that every _bt_checkkeys call
+		 * here will advance the array keys to the maximum possible extent
+		 * that we can know to be safe based on caller's tuple alone.  If we
+		 * didn't perform this step, then that guarantee wouldn't quite hold.
+		 */
+		if (unlikely(!pstate->continuescan))
+		{
+			bool		satisfied PG_USED_FOR_ASSERTS_ONLY;
+
+			Assert(so->keyData[nsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * The tuple must use "beyond end" advancement during the
+			 * recursive call, so we cannot possibly end up back here when
+			 * recursing.  We'll consume a small, fixed amount of stack space.
+			 */
+			Assert(!beyond_end_advance);
+
+			/* Advance the array keys a second time for same tuple */
+			satisfied = _bt_advance_array_keys(scan, pstate, tuple, nsktrig);
+
+			/* This tuple doesn't satisfy the inequality */
+			Assert(!satisfied);
+			return false;
+		}
+
+		/*
+		 * Some non-required scan key (from new qual) still not satisfied.
+		 *
+		 * All required scan keys are still satisfied, though, so we can trust
+		 * all_required_satisfied below.  We now know for sure that even later
+		 * unsatisfied required inequalities can't have been overlooked.
+		 */
+	}
+
+	/*
+	 * Postcondition array state assertions (for still-unsatisfied tuples).
+	 *
+	 * Caller's tuple is now < the newly advanced array keys (or > when this
+	 * is a backwards scan) when not all required scan keys from the new qual
+	 * (including any required inequality keys) were found to be satisfied.
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, false, 0, NULL) ==
+		   !all_required_satisfied);
+
+	/*
+	 * When we were called just to deal with "advancing" non-required arrays,
+	 * there's no way that we can need to start a new primitive index scan
+	 * (and it would be wrong to allow it).  Continue ongoing primitive scan.
+	 */
+	if (!sktrig_required)
+		goto continue_prim_scan;
+
+	/*
+	 * By here we have established that the scan's required arrays were
+	 * advanced, and that they haven't become exhausted.
+	 */
+	Assert(arrays_advanced || !arrays_exhausted);
+
+	/*
+	 * We generally permit primitive index scans to continue onto the next
+	 * sibling page when the page's finaltup satisfies all required scan keys
+	 * at the point where we're just about to move to the page.
+	 *
+	 * If caller's tuple is also the page's finaltup, and we see that that
+	 * isn't the case now, then start a new primitive index scan.
+	 */
+	if (!all_required_satisfied && pstate->finaltup == tuple)
+		goto new_prim_scan;
+
+	/*
+	 * Proactively check finaltup (don't wait until finaltup is reached by the
+	 * scan) when it might well turn out to not be satisfied later on.
+	 *
+	 * This isn't quite equivalent to looking ahead to check if finaltup will
+	 * also be satisfied by all required scan keys, since we won't look at
+	 * inequalities in _bt_tuple_before_array_skeys.  But's it's good enough
+	 * as a precheck, just to saves cycles in common cases.
+	 *
+	 * Note: if so->scanBehind hasn't already been set for finaltup by us,
+	 * it'll be set during this call to _bt_tuple_before_array_skeys.  Either
+	 * way, it'll be set correctly after this point (assuming we don't just
+	 * start a primitive index scan instead).
+	 */
+	if (!all_required_satisfied && pstate->finaltup &&
+		_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, false, 0,
+									 &so->scanBehind))
+		goto new_prim_scan;
+
+	/*
+	 * When we encounter a truncated finaltup high key attribute, we're
+	 * optimistic about the chances of its corresponding required scan key
+	 * being satisfied when we go on to check it against tuples from this
+	 * page's right sibling leaf page.  We consider truncated attributes to be
+	 * satisfied by required scan keys, which allows the primitive index scan
+	 * to continue to the next leaf page.  We must set so->scanBehind to true
+	 * to remember that the last page's finaltup had "satisfied" required scan
+	 * keys (in either direction) that were actually just truncated.
+	 *
+	 * There is a chance that _bt_checkkeys (which checks so->scanBehind) will
+	 * find that even the sibling leaf page's finaltup is < the new array
+	 * keys.  When that happens, our optimistic policy will have incurred a
+	 * single extra leaf page access that could have been avoided.
+	 *
+	 * A pessimistic policy would give backward scans a gratuitous advantage
+	 * over forward scans.  We'd punish forward scans for applying more
+	 * accurate information from the high key, rather than just using the
+	 * final non-pivot tuple as finaltup, in the style of backward scans.
+	 * Being pessimistic would also give some scans with non-required arrays a
+	 * perverse advantage over similar scans that use required arrays instead.
+	 *
+	 * You can think of this as a speculative bet on what the scan is likely
+	 * to find on the next page.  It's not much of a gamble, though, since the
+	 * untruncated prefix of attributes must still match the scan's new qual.
+	 */
+	if (so->scanBehind && has_required_opposite_direction_only)
+	{
+		/*
+		 * However, we avoid this behavior whenever the scan involves a scan
+		 * key required in the opposite direction to the scan only, along with
+		 * a finaltup with at least one truncated attribute that's associated
+		 * with a scan key marked required (required in either direction).
+		 *
+		 * The immediate reason for this restriction is that _bt_check_compare
+		 * simply won't stop the scan for a scan key that's marked required in
+		 * the opposite scan direction only.  That leaves us no reliable way
+		 * of reconsidering the opposite-direction inequalities if it turns
+		 * out that starting a new primitive index scan will allow _bt_first
+		 * to skip ahead by a great many pages (see next section for details).
+		 */
+		goto new_prim_scan;
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 * They can also signal that we should start a new primitive index scan.
+	 *
+	 * It's possible that the scan is now positioned near or at the start of
+	 * "matching" tuples (matching according to _bt_tuple_before_array_skeys),
+	 * but is nevertheless still many leaf pages before the page/key space
+	 * that _bt_first is capable of skipping ahead to.  Groveling through all
+	 * of these leaf pages will always give correct answers, but it can be
+	 * very inefficient.  We must avoid scanning extra pages unnecessarily.
+	 *
+	 * Apply a test using finaltup (not caller's tuple) to avoid the problem:
+	 * if even finaltup doesn't satisfy such an inequality scan key, we skip
+	 * by starting a new primitive index scan.  When we skip, we know for sure
+	 * that all of the tuples on the current page following caller's tuple are
+	 * also before the _bt_first-wise start of tuples for our new qual.  That
+	 * suggests that there might be many skippable leaf pages beyond the
+	 * current page.
+	 */
+	if (has_required_opposite_direction_only && pstate->finaltup &&
+		(all_required_satisfied || oppodir_inequality_sktrig))
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped;
+		bool		continuescanflip;
+		int			opsktrig;
+
+		/*
+		 * We're checking finaltup (which is usually not caller's tuple), so
+		 * cannot reuse work from caller's earlier _bt_check_compare call.
+		 *
+		 * Flip the scan direction when calling _bt_check_compare this time,
+		 * so that it will set continuescanflip=false when it encounters an
+		 * inequality required in the opposite scan direction.
+		 */
+		Assert(!so->scanBehind);
+		opsktrig = 0;
+		flipped = -dir;
+		_bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+						  tupdesc, false, false, false,
+						  &continuescanflip, &opsktrig);
+
+		/*
+		 * Test "opsktrig" to make sure that finaltup contains the same prefix
+		 * of key columns as caller's original tuple (a prefix that satisfies
+		 * earlier required-in-current-direction scan keys).
+		 */
+		Assert(all_required_satisfied != oppodir_inequality_sktrig);
+		if (unlikely(!continuescanflip &&
+					 ((all_required_satisfied && opsktrig > sktrig) ||
+					  (oppodir_inequality_sktrig && opsktrig == sktrig))))
+		{
+			Assert(so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * Make sure that any non-required arrays after the inequality are
+			 * set to the first array element for the current scan direction
+			 */
+			_bt_primitive_scan_inequal_nonrequired(scan, dir, opsktrig + 1);
+
+			goto new_prim_scan;
+		}
+	}
+
+continue_prim_scan:
+
+	/*
+	 * Stick with the ongoing primitive index scan for now.
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will quickly discover the next tuple covered
+	 * by the current array keys.  Repeated binary searches of the page (one
+	 * binary search per array advancement) is unlikely to outperform one
+	 * continuous linear search of the whole page.
+	 *
+	 * Note: It's important that caller never be allowed to continue its
+	 * primitive index scan when it's still not clear to us where the start of
+	 * matches for the new array keys really begin (it must be on this page).
+	 * We're only called when _bt_checkkeys detects that the scan's array keys
+	 * need to be advanced, so if we don't start a new primitive index scan
+	 * then we need to be sure that we'll get called again soon to revisit
+	 * that decision.
+	 */
+	pstate->continuescan = true;	/* Override _bt_check_compare */
+	so->needPrimScan = false;	/* _bt_readpage has more tuples to check */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+new_prim_scan:
+
+	/*
+	 * End this primitive index scan, but scheduled another
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = true;	/* ...but call _bt_first again */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+end_toplevel_scan:
+
+	/*
+	 * End the current primitive index scan, but don't schedule another.
+	 *
+	 * This ends the entire top-level scan.
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = false;	/* ...don't call _bt_first again, though */
+
+	/* Caller's tuple doesn't match any qual */
+	return false;
+}
+
+/*
+ * _bt_primitive_scan_inequal_nonrequired() -- _bt_advance_array_keys helper
+ *
+ * Called when _bt_advance_array_keys decides to start a new primitive index
+ * scan on the basis of the current scan position being before the position
+ * that _bt_first is capable of repositioning the scan to by applying an
+ * inequality operator required in the opposite scan direction only.
+ *
+ * Calling here resets the array scan key elements located at so->KeyData
+ * offsets >= caller's opsktrig offset (if any) to whatever their first array
+ * element is (in the current scan direction).  This is strictly necessary for
+ * correctness in at least some "required in opposite direction" scenarios,
+ * but we always do it to keep things simple.
+ *
+ * Although equality strategy scan keys (for both arrays and non-arrays alike)
+ * are either marked required in both directions or in neither direction,
+ * there is a sense in which non-required arrays behave like required arrays.
+ * With a qual such as "WHERE a IN (100, 200) AND b >= 3 and c in (5, 6, 7)",
+ * the scan key on "c" is non-required, but nevertheless enables positioning
+ * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
+ * first descent of the tree by _bt_first.  Later on, there could also be a
+ * second descent, that repositions the scan at the tuple >= "(200, 3, 5)".
+ * But _bt_first must never be allowed to build an insertion scan key whose
+ * "c" entry is set to a value other than the first "c" array element value
+ * for the current scan direction.
+ */
+static void
+_bt_primitive_scan_inequal_nonrequired(IndexScanDesc scan, ScanDirection dir,
+									   int opsktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+	bool		arrays_advanced = false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		int			first_elem_dir;
+
+		/* Manage array state */
+		if (cur->sk_flags & SK_SEARCHARRAY &&
+			cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+			Assert(skeyarray->sk_attno == cur->sk_attno);
+		}
+		else
+			continue;
+
+		if (ikey < opsktrig)
+			continue;
+
+		/*
+		 * Only want to deal with any non-required arrays, but it's possible
+		 * that we'll come across a required one when a contradictory scan key
+		 * could not be eliminated earlier on.
+		 */
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
+			continue;
+
+		if (ScanDirectionIsForward(dir) || !array)
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+		{
+			array->cur_elem = first_elem_dir;
+			skeyarray->sk_argument = array->elem_values[first_elem_dir];
+			arrays_advanced = true;
+		}
+	}
+
+	if (arrays_advanced)
+	{
+		_bt_update_keys_with_arraykeys(scan);
+		so->advanceDir = dir;
+	}
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -692,7 +2124,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
  * within each attribute may be done as a byproduct of the processing here,
- * but no other code depends on that.
+ * but no other code depends on that.  Note that index scans with array scan
+ * keys depend on state (maintained here by us) that maps each of our input
+ * scan keys to its corresponding output scan key.  This indirection allows
+ * index scans to use an ikey offset-to-output-scankey to look up the cached
+ * ORDER proc for the scankey.
  *
  * The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
  * if they must be satisfied in order to continue the scan forward or backward
@@ -741,6 +2177,13 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * Index scans with array keys need _bt_checkkeys to advance each array's keys
+ * and make them the current search-type scan keys without calling here.
+ * _bt_checkkeys needs to be able to just call _bt_update_keys_with_arraykeys.
+ * We need to be careful about that case when we determine redundancy;
+ * output scan keys must not be eliminated as redundant on the basis of array
+ * input keys that might change before another call here can take place.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -762,13 +2205,34 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfEqualCols;
 	ScanKey		inkeys;
 	ScanKey		outkeys;
+	int		   *orderProcsMap = NULL;
 	ScanKey		cur;
-	ScanKey		xform[BTMaxStrategyNumber];
+	ScanKeyAttr xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
 	AttrNumber	attno;
 
+	Assert(!so->needPrimScan && so->numArrayKeys >= 0);
+
+	/*
+	 * We're called at the start of each primitive index scan during top-level
+	 * scans that use equality array keys.  We can reuse the scan keys that
+	 * were output at the start of the scan's first primitive index scan.
+	 * There is no need to perform exactly the same work more than once.
+	 */
+	if (so->numberOfKeys > 0)
+	{
+		/*
+		 * An earlier call to _bt_update_keys_with_arraykeys already set
+		 * everything up for us.  Just assert that the scan's existing output
+		 * scan keys are consistent with its current array elements.
+		 */
+		Assert(so->numArrayKeys > 0);
+		Assert(_bt_verify_keys_with_arraykeys(scan));
+		return;
+	}
+
 	/* initialize result variables */
 	so->qual_ok = true;
 	so->numberOfKeys = 0;
@@ -780,7 +2244,10 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	 * Read so->arrayKeyData if array keys are present, else scan->keyData
 	 */
 	if (so->arrayKeyData != NULL)
+	{
 		inkeys = so->arrayKeyData;
+		orderProcsMap = so->orderProcsMap;
+	}
 	else
 		inkeys = scan->keyData;
 
@@ -801,6 +2268,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
 			_bt_mark_scankey_required(outkeys);
+		if (orderProcsMap)
+			orderProcsMap[0] = 0;
 		return;
 	}
 
@@ -858,15 +2327,16 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 * unsatisfiable in combination with any other index condition. By
 			 * the time we get here, that's been classified as an equality
 			 * check, and we've rejected any combination of it with a regular
-			 * equality condition; but not with other types of conditions.
+			 * equality condition (including those used with array keys); but
+			 * not with other types of conditions.
 			 */
-			if (xform[BTEqualStrategyNumber - 1])
+			if (xform[BTEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1];
+				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
 
 				for (j = BTMaxStrategyNumber; --j >= 0;)
 				{
-					ScanKey		chk = xform[j];
+					ScanKey		chk = xform[j].skey;
 
 					if (!chk || j == (BTEqualStrategyNumber - 1))
 						continue;
@@ -878,8 +2348,28 @@ _bt_preprocess_keys(IndexScanDesc scan)
 						return;
 					}
 
-					if (_bt_compare_scankey_args(scan, chk, eq, chk,
-												 &test_result))
+					if (eq->sk_flags & SK_SEARCHARRAY)
+					{
+						/*
+						 * Don't try to prove redundancy in the event of an
+						 * inequality strategy scan key that looks like it
+						 * might contradict a subset of the array elements
+						 * from some equality scan key's array.  Just keep
+						 * both keys.
+						 *
+						 * Ideally, we'd handle this by adding a preprocessing
+						 * step to _bt_preprocess_array_keys that eliminates
+						 * the subset of array elements that the inequality
+						 * ipso facto rules out (and eliminates the inequality
+						 * itself, too).  But that seems like a lot of code
+						 * for such a small benefit (_bt_checkkeys is already
+						 * capable of advancing the array keys by a great many
+						 * elements in one step, without requiring too many
+						 * cycles compared to sophisticated preprocessing).
+						 */
+					}
+					else if (_bt_compare_scankey_args(scan, chk, eq, chk,
+													  &test_result))
 					{
 						if (!test_result)
 						{
@@ -888,7 +2378,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							return;
 						}
 						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						xform[j].skey = NULL;
+						xform[j].ikey = -1;
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -897,36 +2388,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			}
 
 			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1]
-				&& xform[BTLessEqualStrategyNumber - 1])
+			if (xform[BTLessStrategyNumber - 1].skey
+				&& xform[BTLessEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1];
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1];
+				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
 
 				if (_bt_compare_scankey_args(scan, le, lt, le,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1] = NULL;
+						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTLessStrategyNumber - 1] = NULL;
+						xform[BTLessStrategyNumber - 1].skey = NULL;
 				}
 			}
 
 			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1]
-				&& xform[BTGreaterEqualStrategyNumber - 1])
+			if (xform[BTGreaterStrategyNumber - 1].skey
+				&& xform[BTGreaterEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1];
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1];
+				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
 
 				if (_bt_compare_scankey_args(scan, ge, gt, ge,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1] = NULL;
+						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTGreaterStrategyNumber - 1] = NULL;
+						xform[BTGreaterStrategyNumber - 1].skey = NULL;
 				}
 			}
 
@@ -937,11 +2428,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 */
 			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				if (xform[j])
+				if (xform[j].skey)
 				{
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-					memcpy(outkey, xform[j], sizeof(ScanKeyData));
+					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+					if (orderProcsMap)
+						orderProcsMap[new_numberOfKeys - 1] = xform[j].ikey;
 					if (priorNumberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
 				}
@@ -967,6 +2460,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
+			if (orderProcsMap)
+				orderProcsMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 
@@ -978,20 +2473,81 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
-		/* have we seen one of these before? */
-		if (xform[j] == NULL)
+		/*
+		 * Is this an array scan key that _bt_preprocess_array_keys merged
+		 * with some earlier array key?
+		 */
+		if (cur->sk_flags & SK_BT_RDDNARRAY)
 		{
-			/* nope, so remember this scankey */
-			xform[j] = cur;
+			/*
+			 * key is redundant for this primitive index scan (and will be
+			 * redundant during all subsequent primitive index scans)
+			 */
+			Assert(j == (BTEqualStrategyNumber - 1));
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(xform[j].skey->sk_attno == cur->sk_attno);
+			continue;
+		}
+
+		/*
+		 * have we seen a scan key for this same attribute and using this same
+		 * operator strategy before now?
+		 */
+		if (xform[j].skey == NULL)
+		{
+			/* nope, so this scan key wins by default (at least for now) */
+			xform[j].skey = cur;
+			xform[j].ikey = i;
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
-										 &test_result))
+			ScanKey		outkey;
+
+			/*
+			 * Seen one of these before, so keep only the more restrictive key
+			 * if possible
+			 */
+			if (j == (BTEqualStrategyNumber - 1) &&
+				(xform[j].skey->sk_flags & SK_SEARCHARRAY) &&
+				!(cur->sk_flags & SK_SEARCHNULL))
+			{
+				/*
+				 * But don't discard the existing equality key if it's an
+				 * array scan key.  We can't conclude that the key is truly
+				 * redundant with an array.  The only exception is "key IS
+				 * NULL" keys, which eliminate every possible array element
+				 * (and so ipso facto make the whole qual contradictory).
+				 *
+				 * Note: redundant and contradictory array keys will have
+				 * already been dealt with by _bt_merge_arrays in the most
+				 * important cases.  Ideally, _bt_merge_arrays would also be
+				 * able to handle all equality keys as "degenerate single
+				 * value arrays", but for now we're better off leaving it up
+				 * to _bt_checkkeys to advance the array keys.
+				 *
+				 * Note: another possible solution to this problem is to
+				 * perform incremental array advancement here instead.  That
+				 * doesn't seem particularly appealing, since it won't perform
+				 * acceptably during scans that have an extremely large number
+				 * of distinct array key combinations (typically due to the
+				 * presence of multiple arrays, each containing merely a large
+				 * number of distinct elements).
+				 *
+				 * Likely only redundant for a subset of array elements...
+				 */
+			}
+			else if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+											  &test_result))
 			{
 				if (test_result)
-					xform[j] = cur;
+				{
+					Assert(!(xform[j].skey->sk_flags & SK_SEARCHARRAY) ||
+						   xform[j].skey->sk_strategy != BTEqualStrategyNumber);
+
+					xform[j].skey = cur;
+					xform[j].ikey = i;
+					continue;
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -999,26 +2555,179 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					return;
 				}
 				/* else old key is more restrictive, keep it */
+				continue;
 			}
 			else
 			{
-				/*
-				 * We can't determine which key is more restrictive.  Keep the
-				 * previous one in xform[j] and push this one directly to the
-				 * output array.
-				 */
-				ScanKey		outkey = &outkeys[new_numberOfKeys++];
-
-				memcpy(outkey, cur, sizeof(ScanKeyData));
-				if (numberOfEqualCols == attno - 1)
-					_bt_mark_scankey_required(outkey);
+				/* else, cannot determine redundancy... */
 			}
+
+			/*
+			 * ...so keep both keys.
+			 *
+			 * We can't determine which key is more restrictive (or we can't
+			 * eliminate an array scan key).  Replace it in xform[j], and push
+			 * the cur one directly to the output array, too.
+			 */
+			outkey = &outkeys[new_numberOfKeys++];
+
+			memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+			if (orderProcsMap)
+				orderProcsMap[new_numberOfKeys - 1] = xform[j].ikey;
+			if (numberOfEqualCols == attno - 1)
+				_bt_mark_scankey_required(outkey);
+			xform[j].skey = cur;
+			xform[j].ikey = i;
 		}
 	}
 
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+/*
+ *	_bt_update_keys_with_arraykeys() -- Finalize advancing array keys
+ *
+ * Transfers newly advanced array keys that were set in "so->arrayKeyData[]"
+ * over to corresponding "so->keyData[]" scan keys.  Reuses most of the work
+ * that took place within _bt_preprocess_keys, only changing the array keys.
+ *
+ * It's safe to call here while holding a buffer lock, which isn't something
+ * that _bt_preprocess_keys can guarantee.
+ */
+static void
+_bt_update_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	Assert(so->qual_ok && so->numArrayKeys > 0);
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		Assert((cur->sk_flags & SK_BT_RDDNARRAY) == 0);
+
+		/* Just update equality array scan keys */
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/* Update the scan key's argument */
+		Assert(cur->sk_attno == skeyarray->sk_attno);
+		cur->sk_argument = skeyarray->sk_argument;
+	}
+
+	Assert(arrayidx == so->numArrayKeys);
+}
+
+#ifdef USE_ASSERT_CHECKING
+/*
+ * Verify that the scan's qual state matches what we expect at the point that
+ * _bt_array_keys_remain is about to start a just-scheduled new primitive
+ * index scan.
+ *
+ * We enforce a rule against non-required array scan keys: they must start out
+ * with whatever element is the first for the scan's current scan direction.
+ * See _bt_primitive_scan_inequal_nonrequired comments for an explanation.
+ */
+static bool
+_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
+
+		/* Manage array state */
+		if (cur->sk_flags & SK_SEARCHARRAY &&
+			cur->sk_strategy == BTEqualStrategyNumber)
+			array = &so->arrayKeys[arrayidx++];
+		else
+			continue;
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+			return false;
+	}
+
+	return _bt_verify_keys_with_arraykeys(scan);
+}
+
+/*
+ * Verify that the scan's "so->arrayKeyData[]" scan keys are in agreement with
+ * the current "so->keyData[]" search-type scan keys
+ */
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			last_proc_map = -1,
+				last_sk_attno = 0,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+		ScanKey		skeyarray;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		skeyarray = &so->arrayKeyData[array->scan_key];
+
+		/*
+		 * Verify that so->orderProcsMap[] mappings are in order for
+		 * SK_SEARCHARRAY equality strategy scan keys
+		 */
+		if (last_proc_map >= so->orderProcsMap[ikey])
+			return false;
+		last_proc_map = so->orderProcsMap[ikey];
+
+		/* Verify so->arrayKeyData[] input key has expected sk_argument */
+		if (skeyarray->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+
+		/* Verify so->arrayKeyData[] input key agrees with output key */
+		if (cur->sk_attno != skeyarray->sk_attno)
+			return false;
+		if (cur->sk_argument != skeyarray->sk_argument)
+			return false;
+		if (last_sk_attno > cur->sk_attno)
+			return false;
+		last_sk_attno = cur->sk_attno;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1352,60 +3061,189 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
  * Forward scan callers can pass a high key tuple in the hopes of having
  * us set *continuescan to false, and avoiding an unnecessary visit to
  * the page to the right.
  *
+ * Advances the scan's array keys when necessary for arrayKeys=true callers.
+ * Caller can avoid all array related side-effects when calling just to do a
+ * page continuescan precheck -- pass arrayKeys=false for that.  Scans without
+ * any arrays keys must always pass arrayKeys=false.
+ *
+ * Also stops and starts primitive index scans for arrayKeys=true callers.
+ * Scans with array keys are required to set up page state that helps us with
+ * this.  The page's finaltup tuple (the page high key for a forward scan, of
+ * the page's first non-pivot tuple for a backward scan) must be set in
+ * pstate.finaltup ahead of the first call here for the page (or possibly the
+ * first call after an initial continuescan-setting page precheck call).  Set
+ * this to NULL for rightmost page (or the leftmost page for backwards scans).
+ *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
+ * arrayKeys: Should we advance the scan's array keys if necessary?
  * tuple: index tuple to test
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
- * 						   be true for the last item on the page
- * haveFirstMatch: indicates that we already have at least one match
- * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
-			  bool continuescanPrechecked, bool haveFirstMatch)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+			  IndexTuple tuple, int tupnatts)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanDirection dir = pstate->dir;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!arrayKeys || (so->advanceDir == dir && so->arrayKeys));
+	Assert(!so->scanBehind || (arrayKeys && ScanDirectionIsForward(dir)));
+	Assert(!so->needPrimScan);
 
+	res = _bt_check_compare(dir, so, tuple, tupnatts, tupdesc, arrayKeys,
+							pstate->continuescanPrechecked,
+							pstate->haveFirstMatch,
+							&pstate->continuescan, &ikey);
+
+#ifdef USE_ASSERT_CHECKING
+	if (pstate->continuescanPrechecked || pstate->haveFirstMatch)
+	{
+		bool		dcontinuescan;
+		int			dikey = 0;
+
+		Assert(res == _bt_check_compare(dir, so, tuple, tupnatts, tupdesc,
+										arrayKeys, false, false,
+										&dcontinuescan, &dikey));
+		Assert(dcontinuescan == pstate->continuescan && ikey == dikey);
+	}
+#endif
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality strategy array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * pstate.continuescan=false.
+	 */
+	if (!arrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.  This could mean that the tuple is just past
+	 * the end of matches for the current array keys.
+	 *
+	 * It's also possible that the scan is still _before_ the _start_ of
+	 * tuples matching the current set of array keys.  Check for that first.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, dir, tuple, true, ikey, NULL))
+	{
+		/*
+		 * Tuple is still before the start of matches according the the scan's
+		 * required array keys (according to _all_ of its required equality
+		 * strategy keys, actually).
+		 *
+		 * Note: we will end up here repeatedly given a group of tuples > the
+		 * previous array keys and < the now-current keys (though only when
+		 * _bt_advance_array_keys determined that key space relevant to the
+		 * scan covers some of the page's remaining unscanned tuples).
+		 *
+		 * _bt_advance_array_keys occasionally sets so->scanBehind to signal
+		 * that the scan's current position/tuples might be significantly
+		 * behind (multiple pages behind) its current array keys.  When this
+		 * happens, we check the page finaltup ourselves.  We'll start a new
+		 * primitive index scan on our own if it turns out that the scan isn't
+		 * now on a page that has at least some tuples covered by the key
+		 * space of the arrays.
+		 *
+		 * This scheme allows _bt_advance_array_keys to optimistically assume
+		 * that the scan will find array key matches for any truncated
+		 * finaltup attributes once the scan reaches the right sibling page
+		 * (only the untruncated prefix have to match the scan's array keys).
+		 */
+		if (unlikely(so->scanBehind) && pstate->finaltup &&
+			_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, false,
+										 0, NULL))
+		{
+			/* Cut our losses -- start a new primitive index scan now */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/* Override _bt_check_compare, continue primitive scan */
+			pstate->continuescan = true;
+		}
+
+		/* This indextuple doesn't match the current qual, in any case */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scan).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (unless the required array keys become exhausted instead, or
+	 * unless the ikey trigger corresponds to a non-required array scan key).
+	 *
+	 * Note: we might advance the required arrays when all existing keys are
+	 * already equal to the values from the tuple at this point.  See comments
+	 * above _bt_advance_array_keys about inequality driven array advancement.
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, ikey);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also sets *continuescan to false
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the current qual (with the current set of array keys), and sets *ikey
+ * to the so->keyData[] subscript/offset for the unsatisfied scan key (needed
+ * when caller needs to consider advancing the scan's array keys).
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys ends the ongoing
+ * primitive index scan.  It is up to our caller to override that initial
+ * determination when it makes more sense to advance the array keys and
+ * continue with further tuples from the same leaf page.
+ *
+ * Note: we set *continuescan to false for arrayKeys=true callers in the event
+ * of an unsatisfied non-required array equality scan key, despite the fact
+ * that it's never safe to end the current primitive index scan when that
+ * happens.  Caller will still need to consider "advancing" the array keys
+ * (which isn't all that different to how it needs to happen for any truly
+ * required arrays).  It's useful to pass arrayKeys=false for calls that
+ * establish the page's continuescanPrechecked state; doing things this way
+ * avoids allowing non-required arrays to needlessly disable the optimization.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool arrayKeys, bool continuescanPrechecked,
+				  bool haveFirstMatch, bool *continuescan, int *ikey)
+{
 	*continuescan = true;		/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (; *ikey < so->numberOfKeys; (*ikey)++)
 	{
+		ScanKey		key = so->keyData + *ikey;
 		Datum		datum;
 		bool		isNull;
-		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1423,8 +3261,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
-			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
+		if (continuescanPrechecked &&
+			(requiredSameDir || (requiredOppositeDirOnly && haveFirstMatch)) &&
+			!(key->sk_flags & SK_ROW_HEADER))
 			continue;
 
 		if (key->sk_attno > tupnatts)
@@ -1435,7 +3274,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1496,6 +3334,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1512,6 +3352,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1525,24 +3367,15 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key-checking function, though only if we must.
+		 *
+		 * When a key is required in the opposite-of-scan direction _only_,
+		 * then it must already be satisfied if haveFirstMatch=true indicates
+		 * that an earlier tuple from this same page satisfied it earlier on.
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
-		{
-			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-									 datum, key->sk_argument);
-		}
-		else
-		{
-			test = true;
-			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
-											 datum, key->sk_argument));
-		}
-
-		if (!DatumGetBool(test))
+		if (!(requiredOppositeDirOnly && haveFirstMatch) &&
+			!DatumGetBool(FunctionCall2Coll(&key->sk_func, key->sk_collation,
+											datum, key->sk_argument)))
 		{
 			/*
 			 * Tuple fails this qual.  If it's a required qual for the current
@@ -1557,6 +3390,24 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			if (requiredSameDir)
 				*continuescan = false;
 
+			/*
+			 * Always set continuescan=false for equality-type array keys that
+			 * don't pass -- even for an array scan key not marked required.
+			 *
+			 * A non-required scan key (array or otherwise) can never really
+			 * end the scan.  But it's convenient for our caller to treat
+			 * continuescan=false as a signal that it might be time to advance
+			 * the array keys, independent of whether they're required or not.
+			 *
+			 * It's up to our _bt_checkkey caller to always be sure to set
+			 * continuescan=true before it returns control to _bt_readpage
+			 * when we've done this.  This is just an implementation detail,
+			 * private to _bt_checkkeys and its helper functions.
+			 */
+			if (arrayKeys && (key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+				*continuescan = false;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1575,7 +3426,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1604,7 +3455,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
@@ -1631,6 +3481,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1647,6 +3499,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbd..2230b1310 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,12 +816,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
 	int			indexcol;
 
+	Assert(skip_nonnative_saop != NULL || scantype == ST_BITMAPSCAN);
+
 	/*
 	 * Check that index supports the desired scan type(s)
 	 */
@@ -880,19 +849,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +864,18 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			if (skip_nonnative_saop && !index->amsearcharray &&
+				IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
-				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
+				/*
+				 * Caller asked us to generate IndexPaths that omit any
+				 * ScalarArrayOpExpr clauses when the underlying index AM
+				 * lacks native support.
+				 *
+				 * We must omit this clause (and tell caller about it).
+				 */
+				*skip_nonnative_saop = true;
+				continue;
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d..47de61da1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6557,8 +6557,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6573,7 +6571,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6603,19 +6601,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6653,27 +6640,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6683,11 +6674,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6703,10 +6692,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6719,7 +6706,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6797,7 +6784,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6812,17 +6798,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6862,14 +6843,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				double		alength = estimate_array_length(root, other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6929,13 +6905,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6945,6 +6914,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6952,9 +6963,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +6982,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4b8b38b70..c7df1a9b9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4063,6 +4063,21 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 79fa117cb..267cb7282 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1932,16 +1932,16 @@ ORDER BY unique1;
       42
 (3 rows)
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1952,26 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 960540002..a031d2341 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8880,10 +8880,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..90a33795d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -765,6 +765,7 @@ SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -774,18 +775,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.43.0

#44

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#43)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sat, 2 Mar 2024 at 02:30, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v11, which now says something like that in the commit
message.

Attached is v12.

Some initial comments on the documentation:

+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.

I don't think the "see <functions-comparisons> for details" is
correctly worded: The surrounding text implies that it would contain
details about in which precise situations multiple primitive index
scans would be consumed, while it only contains documentation about
IN/NOT IN/ANY/ALL/SOME.

Something like the following would fit better IMO:

+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list or array of multiple scalar values (such as those
described in
+    <functions-comparisons> might perform multiple <quote>primitive</quote>
+    index scans (up to one primitive scan per scalar value) at runtime.

Then there is a second issue in the paragraph: Inverted indexes such
as GIN might well decide to start counting more than one "primitive
scan" per scalar value, because they may need to go through their
internal structure more than once to produce results for a single
scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a
trigram index would likely use at least 4 descents here: one for each
of "som", "ome", "wor", "ord".

All that really remains now is to research how we might integrate this
work with the recently added continuescanPrechecked/haveFirstMatch
stuff from Alexander Korotkov, if at all.

The main change in v12 is that I've integrated both the
continuescanPrechecked and the haveFirstMatch optimizations. Both of
these fields are now page-level state, shared between the _bt_readpage
caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
appear next to the new home for _bt_checkkeys' continuescan field, in
the new page state struct).

Cool. I'm planning to review the rest of this patch this
week/tomorrow, could you take some time to review some of my btree
patches, too?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#45

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#44)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.
I don't think the "see <functions-comparisons> for details" is
correctly worded: The surrounding text implies that it would contain
details about in which precise situations multiple primitive index
scans would be consumed, while it only contains documentation about
IN/NOT IN/ANY/ALL/SOME.

Something like the following would fit better IMO:
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list or array of multiple scalar values (such as those
described in
+    <functions-comparisons> might perform multiple <quote>primitive</quote>
+    index scans (up to one primitive scan per scalar value) at runtime.

I think that there is supposed to be a closing parenthesis here? So
"... (such as those described in <functions-comparisons>") might
perform...".

FWM, if that's what you meant.

Then there is a second issue in the paragraph: Inverted indexes such
as GIN might well decide to start counting more than one "primitive
scan" per scalar value, because they may need to go through their
internal structure more than once to produce results for a single
scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a
trigram index would likely use at least 4 descents here: one for each
of "som", "ome", "wor", "ord".

Calling anything other than an executor invocation of
amrescan/index_rescan an "index scan" (or "primitive index scan") is a
bit silly IMV, since, as you point out, whether the index AM manages
to do things like descend the underlying physical index structure is
really just an implementation detail. GIN has more than one B-Tree, so
it's not like there's necessarily just one "index structure", anyway.
It seems to me the only meaningful definition of index scan is
something directly tied to how the index AM gets invoked. That's a
logical, index-AM-agnostic concept. It has a fixed relationship to how
EXPLAIN ANALYZE displays information.

But we're not living in a perfect world. The fact is that the
pg_stat_*_tables system views follow a definition that brings these
implementation details into it. Its definition of index scan is
probably a legacy of when SAOPs could only be executed in a way that
was totally opaque to the index AM (which is how it still works for
every non-nbtree index AM). Back then, "index scan" really was
synonymous with an executor invocation of amrescan/index_rescan with
SAOPs.

I've described the issues in this area (in the docs) in a way that is
most consistent with historical conventions. That seems to have the
fewest problems, despite everything I've said about it.

All that really remains now is to research how we might integrate this
work with the recently added continuescanPrechecked/haveFirstMatch
stuff from Alexander Korotkov, if at all.

The main change in v12 is that I've integrated both the
continuescanPrechecked and the haveFirstMatch optimizations. Both of
these fields are now page-level state, shared between the _bt_readpage
caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
appear next to the new home for _bt_checkkeys' continuescan field, in
the new page state struct).

Cool. I'm planning to review the rest of this patch this
week/tomorrow, could you take some time to review some of my btree
patches, too?

Okay, I'll take a look again.

Attached is v13.

At one point Heikki suggested that I just get rid of
BTScanOpaqueData.arrayKeyData (which has been there for as long as
nbtree had native support for SAOPs), and use
BTScanOpaqueData.keyData exclusively instead. I've finally got around
to doing that now.

These simplifications were enabled by my new approach within
_bt_preprocess_keys, described when I posted v12. v13 goes even
further than v12 did, by demoting _bt_preprocess_array_keys to a
helper function for _bt_preprocess_keys. That means that we do all of
our scan key preprocessing at the same time, at the start of _bt_first
(though only during the first _bt_first, or to be more precise during
the first per btrescan). If we need fancier
preprocessing/normalization for arrays, then it ought to be a lot
easier with this structure.

Note that we no longer need to have an independent representation of
so->qual_okay for array keys (the convention of setting
so->numArrayKeys to -1 for unsatisfiable array keys is no longer
required). There is no longer any need for a separate pass to carry
over the contents of BTScanOpaqueData.arrayKeyData to
BTScanOpaqueData.keyData, which was confusing.

Are you still interested in working directly on the preprocessing
stuff? I have a feeling that I was slightly too optimistic about how
likely we were to be able to get away with not having certain kinds of
array preprocessing, back when I posted v12. It's true that the
propensity of the patch to not recognize "partial
redundancies/contradictions" hardly matters with redundant equalities,
but inequalities are another matter. I'm slightly worried about cases
like this one:

select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;

Maybe we need to do better with that. What do you think?

--
Peter Geoghegan

Attachments:

v13-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/x-patch; name=v13-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From e6733320d38480bf5ef168ea4b29331df803c067 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v13] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with any out-of-core
amcanorder-based index AMs that coevolved with nbtree.  Such an index AM
could have had similar limitations around SOAP execution, and so could
have come to rely on the planner workarounds removed by this commit.
Although it seems unlikely that such an index AM really exists, it still
warrants a pro forma compatibility item in the release notes.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   53 +-
 src/backend/access/nbtree/nbtree.c         |  113 +-
 src/backend/access/nbtree/nbtsearch.c      |  174 +-
 src/backend/access/nbtree/nbtutils.c       | 2338 +++++++++++++++++---
 src/backend/optimizer/path/indxpath.c      |   90 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   15 +
 src/test/regress/expected/create_index.out |   33 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   12 +-
 10 files changed, 2395 insertions(+), 560 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052..5f1c088a0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,7 +960,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1022,9 +1022,8 @@ typedef BTScanPosData *BTScanPos;
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
-	int			scan_key;		/* index of associated key in arrayKeyData */
+	int			scan_key;		/* index of associated key in keyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1037,14 +1036,15 @@ typedef struct BTScanOpaqueData
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
 	/* workspace for SK_SEARCHARRAY support */
-	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
-	int			numArrayKeys;	/* number of equality-type array keys (-1 if
-								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	int			numArrayKeys;	/* number of equality-type array keys */
+	ScanDirection advanceDir;	/* Scan direction when arrays last advanced */
+	bool		scanBehind;		/* Scan might be behind arrays? */
+	bool		needPrimScan;	/* Need primscan to continue in advanceDir? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for all equality-type keys */
+	int		   *keyDataMap;		/* maps keyData entries to input scan keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1075,26 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* Needed by scans with array keys */
+
+	/* Output parameter, set by _bt_checkkeys for _bt_readpage */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/*
+	 * Input and output parameters, set and unset by both _bt_readpage and
+	 * _bt_checkkeys to manage precheck optimizations
+	 */
+	bool		prechecked;		/* precheck set continuescan? */
+	bool		firstmatch;		/* at least one match so far?  */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1152,7 +1172,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1243,15 +1263,12 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_preprocess_array_keys(IndexScanDesc scan);
+extern bool _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
+extern void _bt_rewind_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+						  IndexTuple tuple, int tupnatts);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 21d879a3b..ddc3a6b20 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -206,21 +206,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
-	/*
-	 * If we have any array keys, initialize them during first call for a
-	 * scan.  We can't do this in btrescan because we don't know the scan
-	 * direction at that time.
-	 */
-	if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return false;
-
-		_bt_start_array_keys(scan, dir);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -262,8 +248,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
 	return res;
 }
@@ -278,19 +264,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
-	/*
-	 * If we have any array keys, initialize them.
-	 */
-	if (so->numArrayKeys)
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return ntids;
-
-		_bt_start_array_keys(scan, ForwardScanDirection);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -320,8 +294,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -350,10 +324,13 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	else
 		so->keyData = NULL;
 
-	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->scanBehind = false;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
+	so->keyDataMap = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -393,7 +370,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -427,9 +406,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
-
-	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
-	_bt_preprocess_array_keys(scan);
 }
 
 /*
@@ -457,7 +433,7 @@ btendscan(IndexScanDesc scan)
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
-	/* so->arrayKeyData and so->arrayKeys are in arrayContext */
+	/* so->arrayKeys is in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
 	if (so->killedItems != NULL)
@@ -492,10 +468,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -506,10 +478,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -548,6 +516,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Rewind the scan's array keys, if any */
+			if (so->numArrayKeys)
+				_bt_rewind_array_keys(scan);
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -574,7 +545,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -600,7 +571,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -611,7 +582,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -642,16 +617,16 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  The top-level index
+			 * scan might require additional primitive index scans.
 			 */
 			status = false;
 		}
@@ -683,9 +658,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -719,12 +697,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_start_prim_scan.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -738,14 +715,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -754,13 +731,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 63ee9ba22..68bc32c6e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,11 +907,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
 
+	if (so->numArrayKeys)
+	{
+		if (ScanDirectionIsNoMovement(so->advanceDir))
+		{
+			/*
+			 * First primitive index scan (for current btrescan).
+			 *
+			 * Initialize arrays, and the corresponding scan keys that were
+			 * just output by _bt_preprocess_keys.
+			 */
+			_bt_start_array_keys(scan, dir);
+		}
+		else
+		{
+			/*
+			 * Just stick with the array keys set by _bt_checkkeys at the end
+			 * of the previous primitive index scan.
+			 *
+			 * Note: The initial primitive index scan's _bt_preprocess_keys
+			 * call actually outputs new keys.  Later calls are just no-ops.
+			 * We're just here to build an insertion scan key using values
+			 * already set in so->keyData[] by _bt_checkkeys.
+			 */
+		}
+		so->advanceDir = dir;
+	}
+
 	/*
 	 * For parallel scans, get the starting page from shared state. If the
 	 * scan has not started, proceed to find out first leaf page in the usual
@@ -1527,11 +1554,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
-	bool		continuescanPrechecked;
-	bool		haveFirstMatch = false;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1551,8 +1577,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.prechecked = false;
+	pstate.firstmatch = false;
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	arrayKeys = so->numArrayKeys != 0;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1598,10 +1630,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * corresponding value from the last item on the page.  So checking with
 	 * the last item on the page would give a more precise answer.
 	 *
-	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * We skip this for the scan's first page to avoid slowing down point
+	 * queries.  We also have to avoid applying the optimization in rare cases
+	 * where it's not yet clear that the scan is at or ahead of its current
+	 * array keys.  If we're behind, but not too far behind (the start of
+	 * tuples matching the current keys is somewhere before the last item),
+	 * then the optimization is unsafe.
+	 *
+	 * Cases with multiple distinct sets of required array keys for key space
+	 * from the same leaf page can _attempt_ to use the precheck optimization,
+	 * though.  It won't work out, but there's no better way of figuring that
+	 * out than just optimistically attempting the precheck.
+	 *
+	 * The array keys safety issue is related to our reliance on _bt_first
+	 * passing us an offnum that's exactly at the beginning of where equal
+	 * tuples are to be found.  The underlying problem is that we have no
+	 * built-in ability to tell the difference between the start of required
+	 * equality matches and the end of required equality matches.  Array key
+	 * advancement within _bt_checkkeys has to act as a "_bt_first surrogate"
+	 * whenever the start of tuples matching the next set of array keys is
+	 * close to the end of tuples matching the current/last set of array keys.
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !so->scanBehind && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1660,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Do the precheck, while avoiding advancing the scan's array keys
+		 * prematurely
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, false, itup, indnatts);
+		pstate.prechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,23 +1702,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1696,7 +1739,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1756,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			pstate.prechecked = false;	/* prechecked earlier tuple */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1777,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys && minoff <= maxoff && !P_LEFTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1772,23 +1824,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1824,7 +1866,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1999,6 +2041,21 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the right
+		 */
+		if (ScanDirectionIsBackward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreRight = true;
+			so->advanceDir = dir;
+			so->scanBehind = false;
+			so->needPrimScan = false;
+		}
+
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 	}
@@ -2007,6 +2064,21 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreRight = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the left
+		 */
+		if (ScanDirectionIsForward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreLeft = true;
+			so->advanceDir = dir;
+			so->scanBehind = false;
+			so->needPrimScan = false;
+		}
+
 		if (scan->parallel_scan != NULL)
 		{
 			/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2e6fc14d7..2adccb277 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,23 +33,57 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *sortproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
 
+typedef struct ScanKeyAttr
+{
+	ScanKey		skey;
+	int			ikey;
+} ScanKeyAttr;
+
+static void _bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+								FmgrInfo *orderproc, FmgrInfo **sortprocp);
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
-									  StrategyNumber strat,
+									  Oid elemtype, StrategyNumber strat,
 									  Datum *elems, int nelems);
-static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-									bool reverse,
-									Datum *elems, int nelems);
+static int	_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc,
+									bool reverse, Datum *elems, int nelems);
+static int	_bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+										 IndexTuple tuple, bool readpagetup,
+										 int sktrig, bool *scanBehind);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int sktrig);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool arrayKeys, bool prechecked, bool firstmatch,
+							  bool *continuescan, int *ikey);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -189,29 +223,41 @@ _bt_freestack(BTStack stack)
  *
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
- * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * Return modified scan keys as input for further, standard preprocessing.
  *
- * Note: the reason we need so->arrayKeyData, rather than just scribbling
- * on scan->keyData, is that callers are permitted to call btrescan without
- * supplying a new set of scankey data.
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array key as redundant.  Similarly, we are capable of "merging
+ * together" multiple equality array keys (from two or more input scan keys)
+ * into a single output scan key that contains only the intersecting array
+ * elements.  This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.  It can also allow us to
+ * detect contradictory quals early.
+ *
+ * Note: the reason we need to return a temp scan key array, rather than just
+ * scribbling on scan->keyData, is that callers are permitted to call btrescan
+ * without supplying a new set of scankey data.
  */
-void
+static ScanKey
 _bt_preprocess_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
 	int			numberOfKeys = scan->numberOfKeys;
-	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16	   *indoption = rel->rd_indoption;
 	int			numArrayKeys;
+	int			prevArrayAtt = -1;
+	Oid			prevElemtype = InvalidOid;
 	ScanKey		cur;
-	int			i;
 	MemoryContext oldContext;
+	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+
+	Assert(numberOfKeys && so->advanceDir == NoMovementScanDirection);
 
 	/* Quick check to see if there are any array keys */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
 		cur = &scan->keyData[i];
 		if (cur->sk_flags & SK_SEARCHARRAY)
@@ -221,20 +267,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			/* If any arrays are null as a whole, we can quit right now. */
 			if (cur->sk_flags & SK_ISNULL)
 			{
-				so->numArrayKeys = -1;
-				so->arrayKeyData = NULL;
-				return;
+				so->qual_ok = false;
+				return NULL;
 			}
 		}
 	}
 
 	/* Quit if nothing to do. */
 	if (numArrayKeys == 0)
-	{
-		so->numArrayKeys = 0;
-		so->arrayKeyData = NULL;
-		return;
-	}
+		return NULL;
 
 	/*
 	 * Make a scan-lifespan context to hold array-associated data, or reset it
@@ -250,18 +291,24 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	oldContext = MemoryContextSwitchTo(so->arrayContext);
 
 	/* Create modifiable copy of scan->keyData in the workspace context */
-	so->arrayKeyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-	memcpy(so->arrayKeyData,
-		   scan->keyData,
-		   scan->numberOfKeys * sizeof(ScanKeyData));
+	arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
+	memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
 
 	/* Allocate space for per-array data in the workspace context */
-	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
+
+	/* Allocate space for ORDER procs used during array binary searches */
+	so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+	so->keyDataMap = (int *) palloc(numberOfKeys * sizeof(int));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
+		FmgrInfo	sortproc;
+		FmgrInfo   *sortprocp = &sortproc;
+		bool		reverse;
+		Oid			elemtype;
 		ArrayType  *arrayval;
 		int16		elmlen;
 		bool		elmbyval;
@@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			num_nonnulls;
 		int			j;
 
-		cur = &so->arrayKeyData[i];
+		cur = &arrayKeyData[i];
+		reverse = (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0;
+
+		/*
+		 * Determine the nominal datatype of the array elements.  We have to
+		 * support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this is a hack to simplify life for
+		 * ScanKeyInit().
+		 */
+		elemtype = cur->sk_subtype;
+		if (elemtype == InvalidOid)
+			elemtype = rel->rd_opcintype[cur->sk_attno - 1];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way ORDER proc to perform binary
+		 * searches for the next matching array element.  Set that up now.
+		 *
+		 * Array scan keys with cross-type equality operators will require a
+		 * separate same-type ORDER proc for sorting their array.  Otherwise,
+		 * sortproc just points to the same proc used during binary searches.
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+			_bt_setup_array_cmp(scan, cur, elemtype,
+								&so->orderProcs[i], &sortprocp);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -306,8 +378,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/* If there's no non-nulls, the scan qual is unsatisfiable */
 		if (num_nonnulls == 0)
 		{
-			numArrayKeys = -1;
-			break;
+			so->qual_ok = false;
+			return NULL;
 		}
 
 		/*
@@ -320,7 +392,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTLessStrategyNumber:
 			case BTLessEqualStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTGreaterStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -330,7 +402,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTGreaterEqualStrategyNumber:
 			case BTGreaterStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTLessStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -343,24 +415,163 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/*
 		 * Sort the non-null elements and eliminate any duplicates.  We must
 		 * sort in the same ordering used by the index column, so that the
-		 * successive primitive indexscans produce data in index order.
+		 * arrays can be advanced in lockstep with the scan's progress through
+		 * the index's key space.
 		 */
-		num_elems = _bt_sort_array_elements(scan, cur,
-											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+		Assert(cur->sk_strategy == BTEqualStrategyNumber);
+		num_elems = _bt_sort_array_elements(cur, sortprocp, reverse,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and whole scan keys).
+		 *
+		 * We don't support merging arrays (for same-attribute scankeys) when
+		 * the array element types don't match.  Note that this is orthogonal
+		 * to whether cross-type operators are used (whether the element type
+		 * matches or fails to match the on-disk/opclass type is irrelevant).
+		 */
+		if (prevArrayAtt == cur->sk_attno && prevElemtype == elemtype)
+		{
+			BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+			Assert(arrayKeyData[prev->scan_key].sk_attno == cur->sk_attno);
+			Assert(arrayKeyData[prev->scan_key].sk_func.fn_oid ==
+				   cur->sk_func.fn_oid);
+			Assert(arrayKeyData[prev->scan_key].sk_collation ==
+				   cur->sk_collation);
+
+			num_elems = _bt_merge_arrays(cur, sortprocp, reverse,
+										 prev->elem_values, prev->num_elems,
+										 elem_values, num_elems);
+
+			pfree(elem_values);
+
+			/*
+			 * If there are no intersecting elements left from merging this
+			 * array into the previous array on the same attribute, the scan
+			 * qual is unsatisfiable
+			 */
+			if (num_elems == 0)
+			{
+				so->qual_ok = false;
+				return NULL;
+			}
+
+			/*
+			 * Lower the number of elements from the previous array.  This
+			 * scan key/array is redundant.  Dealing with that is finalized
+			 * within _bt_preprocess_keys.
+			 */
+			prev->num_elems = num_elems;
+			cur->sk_strategy = InvalidStrategy; /* for _bt_preprocess_keys */
+			continue;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
 		 */
-		so->arrayKeys[numArrayKeys].scan_key = i;
+		so->arrayKeys[numArrayKeys].scan_key = i;	/* will be adjusted later */
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		prevArrayAtt = cur->sk_attno;
+		prevElemtype = elemtype;
 	}
 
 	so->numArrayKeys = numArrayKeys;
 
 	MemoryContextSwitchTo(oldContext);
+
+	return arrayKeyData;
+}
+
+/*
+ * _bt_setup_array_cmp() -- Set up array comparison functions
+ *
+ * Sets ORDER proc in caller's orderproc argument, which is used during binary
+ * searches of arrays during the index scan.  Also sets a same-type ORDER proc
+ * in caller's *sortprocp argument.
+ *
+ * Caller should pass an orderproc pointing to space that'll store the ORDER
+ * proc for the scan, and a *sortprocp pointing to its own separate space.
+ *
+ * In the common case where we don't need to deal with cross-type operators,
+ * only one ORDER proc is actually required by caller.  We'll set *sortprocp
+ * to point to the same memory that caller's orderproc continues to point to.
+ * Otherwise, *sortprocp will continue to point to separate memory, which
+ * we'll initialize separately (with an "(elemtype, elemtype)" ORDER proc that
+ * can be used to sort arrays).
+ *
+ * Array preprocessing calls here with all equality strategy scan keys,
+ * including any that don't use an array at all.  See _bt_advance_array_keys
+ * for an explanation of why we need to treat these as degenerate single-value
+ * arrays when the scan advances its arrays.
+ */
+static void
+_bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+					FmgrInfo *orderproc, FmgrInfo **sortprocp)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	RegProcedure cmp_proc;
+	Oid			opclasstype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
+
+	/*
+	 * Look up the appropriate comparison function in the opfamily.  This must
+	 * use the opclass type as its left hand arg type, and the array element
+	 * as its right hand arg type (since binary searches search for the array
+	 * value that best matches the next on-disk index tuple for the scan).
+	 *
+	 * Note: it's possible that this would fail, if the opfamily lacks the
+	 * required cross-type ORDER proc.  But this is no different to the case
+	 * where _bt_first fails to find an ORDER proc for its insertion scan key.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 opclasstype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, opclasstype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+
+	if (opclasstype == elemtype || !(skey->sk_flags & SK_SEARCHARRAY))
+	{
+		/*
+		 * A second opfamily support proc lookup can be avoided in the common
+		 * case where the ORDER proc used for the scan's binary searches uses
+		 * the opclass/on-disk datatype for both its left and right arguments.
+		 *
+		 * Also avoid a separate lookup whenever scan key lacks an array.
+		 * There is nothing for caller to sort anyway, but be consistent.
+		 */
+		*sortprocp = orderproc;
+		return;
+	}
+
+	/*
+	 * Look up the appropriate same-type comparison function in the opfamily.
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but it seems quite unlikely that an opfamily would omit
+	 * non-cross-type support functions for any datatype that it supports at
+	 * all.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 elemtype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, elemtype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set same-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, *sortprocp, so->arrayContext);
 }
 
 /*
@@ -371,27 +582,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
  * least element, or BTGreaterStrategyNumber to get the greatest.
  */
 static Datum
-_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
+_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey, Oid elemtype,
 						 StrategyNumber strat,
 						 Datum *elems, int nelems)
 {
 	Relation	rel = scan->indexRelation;
-	Oid			elemtype,
-				cmp_op;
+	Oid			cmp_op;
 	RegProcedure cmp_proc;
 	FmgrInfo	flinfo;
 	Datum		result;
 	int			i;
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
 	/*
 	 * Look up the appropriate comparison operator in the opfamily.
 	 *
@@ -400,6 +601,8 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	 * non-cross-type comparison operators for any datatype that it supports
 	 * at all.
 	 */
+	Assert(skey->sk_strategy != BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
 	cmp_op = get_opfamily_member(rel->rd_opfamily[skey->sk_attno - 1],
 								 elemtype,
 								 elemtype,
@@ -434,50 +637,26 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
  * The array elements are sorted in-place, and the new number of elements
  * after duplicate removal is returned.
  *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * skey identifies the index column whose opfamily determines the comparison
+ * semantics, and sortproc is a corresponding ORDER proc.  If reverse is true,
+ * we sort in descending order.
+ *
+ * Note: sortproc arg must be an ORDER proc suitable for sorting: it must
+ * compare arguments that are both of the same type as the array elements
+ * being sorted (even during scans that perform binary searches against the
+ * arrays using distinct cross-type ORDER procs).
  */
 static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
+_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -488,6 +667,47 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTSortArrayContext cxt;
+	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	int			merged_nelems = 0;
+
+	/*
+	 * Incrementally copy the original array into a temp buffer, skipping over
+	 * any items that are missing from the "next" array
+	 */
+	cxt.sortproc = sortproc;
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0; i < nelems_orig; i++)
+	{
+		Datum	   *elem = elems_orig + i;
+
+		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+						_bt_compare_array_elements, &cxt))
+			merged[merged_nelems++] = *elem;
+	}
+
+	/*
+	 * Overwrite the original array with temp buffer so that we're only left
+	 * with intersecting array elements
+	 */
+	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	pfree(merged);
+
+	return merged_nelems;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -499,7 +719,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->sortproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -507,11 +727,160 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound for backward
+ * scans).  It's safe for searches against required scan key arrays to reuse
+ * earlier search bounds like this because such arrays always advance in
+ * lockstep with the index scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem = 0,
+				mid_elem = -1,
+				high_elem = array->num_elems - 1,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur_elem_start)
+	{
+		if (ScanDirectionIsForward(dir))
+			low_elem = array->cur_elem;
+		else
+			high_elem = array->cur_elem;
+	}
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * It's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the low_elem datum.  Must always set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
  * Set up the cur_elem counters and fill in the first sk_argument value for
- * each array scankey.  We can't do this until we know the scan direction.
+ * each array scankey.
  */
 void
 _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			i;
 
+	Assert(so->numArrayKeys);
+	Assert(so->qual_ok);
+
 	for (i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 
 		Assert(curArrayKey->num_elems > 0);
+		Assert(skey->sk_flags & SK_SEARCHARRAY);
+
 		if (ScanDirectionIsBackward(dir))
 			curArrayKey->cur_elem = curArrayKey->num_elems - 1;
 		else
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 		int			cur_elem = curArrayKey->cur_elem;
 		int			num_elems = curArrayKey->num_elems;
+		bool		rolled = false;
 
-		if (ScanDirectionIsBackward(dir))
+		if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
 		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = 0;
+			rolled = true;
 		}
-		else
+		else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
 		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = num_elems - 1;
+			rolled = true;
 		}
 
 		curArrayKey->cur_elem = cur_elem;
 		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
+		if (!rolled)
+			return true;
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+		/* Need to advance next array key, if any */
+	}
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * The array keys are now exhausted.
+	 *
+	 * There isn't actually a distinct state that represents array exhaustion,
+	 * since index scans don't always end when btgettuple returns "false". The
+	 * scan direction might be reversed, or the scan might yet have its last
+	 * saved position restored.
+	 *
+	 * Restore the array keys to the state they were in immediately before we
+	 * were called.  This ensures that the arrays can only ever ratchet in the
+	 * scan's current direction.  Without this, scans would overlook matching
+	 * tuples if and when the scan's direction was subsequently reversed.
 	 */
-	if (!found)
-		so->arraysStarted = false;
+	_bt_start_array_keys(scan, -dir);
 
-	return found;
+	return false;
 }
 
 /*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
+ * _bt_rewind_nonrequired_arrays() -- Rewind non-required arrays
  *
- * Save the current state of the array keys as the "mark" position.
+ * Called when _bt_advance_array_keys decides to start a new primitive index
+ * scan on the basis of the current scan position being before the position
+ * that _bt_first is capable of repositioning the scan to by applying an
+ * inequality operator required in the opposite-to-scan direction only.
+ *
+ * Although equality strategy scan keys (for both arrays and non-arrays alike)
+ * are either marked required in both directions or in neither direction,
+ * there is a sense in which non-required arrays behave like required arrays.
+ * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
+ * the scan key on "c" is non-required, but nevertheless enables positioning
+ * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
+ * first descent of the tree by _bt_first.  Later on, there could also be a
+ * second descent, that places the scan right before tuples >= "(200, 3, 5)".
+ * _bt_first must never be allowed to build an insertion scan key whose "c"
+ * entry is set to a value other than 5, the "c" array's first element/value.
+ * (Actually, it's the first in the current scan direction.  This example uses
+ * a forward scan.)
+ *
+ * Calling here resets the array scan key elements for the scan's non-required
+ * arrays.  This is strictly necessary for correctness in a subset of cases
+ * involving "required in opposite direction"-triggered primitive index scans.
+ * Not all callers are at risk of _bt_first using a non-required array like
+ * this, but advancement always resets the arrays, just to keep things simple.
+ * Array advancement even makes sure to reset non-required arrays like this
+ * during scans that have no inequalities.  Advancement won't ever need to
+ * call here, though that's just because it is all handled indirectly instead.
+ *
+ * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
+ * everybody got this right.  Note that this only happens between each call to
+ * _bt_first (never after the final _bt_first call).
  */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
+static void
+_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
+	int			arrayidx = 0;
+	bool		arrays_advanced = false;
 
-	for (i = 0; i < so->numArrayKeys; i++)
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
-	}
-}
+		if (!(cur->sk_flags & SK_SEARCHARRAY) &&
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
 
-/*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
- *
- * Restore the array keys to where they were when the mark was set.
- */
-void
-_bt_restore_array_keys(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		changed = false;
-	int			i;
+		array = &so->arrayKeys[arrayidx++];
+		Assert(array->scan_key == ikey);
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
-	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
+			continue;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		if (ScanDirectionIsForward(dir) || !array)
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
-			changed = true;
+			array->cur_elem = first_elem_dir;
+			cur->sk_argument = array->elem_values[first_elem_dir];
+			arrays_advanced = true;
+		}
+	}
+
+	if (arrays_advanced)
+		so->advanceDir = dir;
+}
+
+/*
+ * _bt_rewind_array_keys() -- Handle array keys during btrestrpos
+ *
+ * Restore the array keys to the start of the key space for the current scan
+ * direction as of the last time the arrays advanced.
+ *
+ * Once the scan reaches _bt_advance_array_keys, the arrays will advance up to
+ * the key space of the actual tuples from the mark position's leaf page.
+ */
+void
+_bt_rewind_array_keys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(!ScanDirectionIsNoMovement(so->advanceDir));
+	Assert(so->qual_ok && so->numArrayKeys);
+
+	/*
+	 * First reinitialize the array keys to the first elements for the scan
+	 * direction at the time that the arrays last advanced
+	 */
+	_bt_start_array_keys(scan, so->advanceDir);
+
+	/*
+	 * Next invert the scan direction as of the last time the array keys
+	 * advanced.
+	 *
+	 * This prevents _bt_steppage from fully trusting currPos.moreRight and
+	 * currPos.moreLeft in cases where _bt_readpage/_bt_checkkeys don't get
+	 * the opportunity to consider advancing the array keys as expected.
+	 */
+	if (ScanDirectionIsForward(so->advanceDir))
+		so->advanceDir = BackwardScanDirection;
+	else
+		so->advanceDir = ForwardScanDirection;
+
+	so->scanBehind = true;
+	so->needPrimScan = false;
+}
+
+/*
+ * _bt_tuple_before_array_skeys() -- determine if tuple advances array keys
+ *
+ * We always compare the tuple using the current array keys (which we assume
+ * are already set in so->keyData[]).  readpagetup indicates if tuple is the
+ * scan's current _bt_readpage-wise tuple.
+ *
+ * readpagetup callers must only call here when _bt_check_compare already set
+ * continuescan=false.  We help these callers deal with _bt_check_compare's
+ * inability to distinguishing between the < and > cases (it uses equality
+ * operator scan keys, whereas we use 3-way ORDER procs).
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This happens to readpagetup callers when tuple is still before the
+ * start of matches for the scan's current required array keys.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This happens to readpagetup callers when the
+ * scan has reached the point of needing its array keys advanced.
+ *
+ * As an optimization, readpagetup callers pass a _bt_check_compare-set sktrig
+ * value to indicate which scan key triggered _bt_checkkeys to recheck with us
+ * (!readpagetup callers must always pass sktrig=0).  This allows us to avoid
+ * wastefully checking earlier scan keys that _bt_check_compare already found
+ * to be satisfied by the current qual/set of array keys.  If sktrig indicates
+ * a non-required array that _bt_check_compare just set continuescan=false for
+ * (see _bt_check_compare for an explanation), then we always return false.
+ *
+ * !readpagetup callers optionally pass us *scanBehind, which tracks whether
+ * any missing truncated attributes might have affected array advancement
+ * (compared to what would happen if it was shown the first non-pivot tuple on
+ * the page to the right of caller's finaltup/high key tuple instead).  It's
+ * only possible that we'll set *scanBehind to true when caller passes us a
+ * pivot tuple (with truncated attributes) that we return false for.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+							 IndexTuple tuple, bool readpagetup, int sktrig,
+							 bool *scanBehind)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel);
+
+	Assert(so->numArrayKeys);
+	Assert(so->numberOfKeys);
+	Assert(!so->needPrimScan);
+	Assert(sktrig == 0 || readpagetup);
+	Assert(!readpagetup || scanBehind == NULL);
+
+	if (scanBehind)
+		*scanBehind = false;
+
+	for (; sktrig < so->numberOfKeys; sktrig++)
+	{
+		ScanKey		cur = so->keyData + sktrig;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
+
+		/*
+		 * Once we reach a non-required scan key, we're completely done.
+		 *
+		 * Note: we deliberately don't consider the scan direction here.
+		 * _bt_advance_array_keys caller requires that we track *scanBehind
+		 * without concern for scan direction.
+		 */
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) == 0)
+		{
+			Assert(!readpagetup || (cur->sk_strategy == BTEqualStrategyNumber &&
+									(cur->sk_flags & SK_SEARCHARRAY)));
+			return false;
+		}
+
+		/* readpagetup calls require one ORDER proc comparison (at most) */
+		Assert(!readpagetup || cur == so->keyData + sktrig);
+
+		if (cur->sk_attno > ntupatts)
+		{
+			Assert(!readpagetup);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys (but set *scanBehind to let interested callers know
+			 * that a truncated attribute might have affected our answer).
+			 */
+			if (scanBehind)
+				*scanBehind = true;
+
+			return false;
+		}
+
+		/*
+		 * Inequality strategy scan keys (that are required in current scan
+		 * direction) can only be evaluated by _bt_check_compare
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+		{
+			/*
+			 * Give up right away when _bt_check_compare indicated that a
+			 * required inequality scan key wasn't satisfied
+			 */
+			if (readpagetup)
+				return false;
+
+			/*
+			 * Otherwise we can't give up.  There can't be any required
+			 * equality strategy scan keys after this one, but we still need
+			 * to maintain *scanBehind for any later required inequality keys.
+			 */
+			continue;
+		}
+
+		orderproc = &so->orderProcs[so->keyDataMap[sktrig]];
+		tupdatum = index_getattr(tuple, cur->sk_attno, itupdesc, &tupnull);
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?  (Must be if we get here during a readpagetup call.)
+		 */
+		if (readpagetup || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconclusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or a call made from an assertion.
+		 */
+		Assert(result == 0);
+		Assert(!readpagetup);
+	}
+
+	return false;
+}
+
+/*
+ * _bt_start_prim_scan() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys,
+ * after _bt_first or _bt_next return false.
+ */
+bool
+_bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(so->advanceDir == dir || !so->qual_ok);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  The flag tells us what to do.
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys.  There are
+	 * various cases where that won't happen.  For example, if the index is
+	 * completely empty, then _bt_first won't call _bt_readpage/_bt_checkkeys.
+	 * We also don't expect a call to _bt_checkkeys during searches for a
+	 * non-existent value that happens to be lower/higher than any existing
+	 * value in the index.
+	 *
+	 * We don't require special handling for these cases -- we don't need to
+	 * be explicitly instructed to _not_ perform another primitive index scan.
+	 * It's up to code under the control of _bt_first to always set the flag
+	 * when another primitive index scan will be required.
+	 *
+	 * This works correctly, even with the tricky cases listed above, which
+	 * all involve access to leaf pages "near the boundaries of the key space"
+	 * (whether it's from a leftmost/rightmost page, or an imaginary empty
+	 * leaf root page).  If _bt_checkkeys cannot be reached by a primitive
+	 * index scan for one set of array keys, then it also won't be reached for
+	 * any later set ("later" in terms of the direction that we scan the index
+	 * and advance the arrays).  The array keys won't have advanced in these
+	 * cases, but that's the correct behavior (even _bt_advance_array_keys
+	 * won't always advance the arrays at the point they become "exhausted").
+	 */
+	if (so->needPrimScan)
+	{
+		Assert(_bt_verify_arrays_bt_first(scan, dir));
+
+		/* Flag was set -- must call _bt_first again */
+		so->needPrimScan = false;
+		so->scanBehind = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/* The top-level index scan ran out of tuples in this scan direction */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * The scan always gets a new qual as a consequence of calling here (except
+ * when we determine that the top-level scan has run out of matching tuples).
+ * All later _bt_check_compare calls also use the same new qual that was first
+ * used here (at least until the next call here advances the keys once again).
+ * It's convenient to structure _bt_check_compare rechecks of caller's tuple
+ * (using the new qual) as one the steps of advancing the scan's array keys,
+ * so this function works as a wrapper around _bt_check_compare.
+ *
+ * Like _bt_check_compare, we'll set pstate.continuescan on behalf of the
+ * caller, and return a boolean indicating if caller's tuple satisfies the
+ * scan's new qual.  But unlike _bt_check_compare, we set so->needPrimScan
+ * when we set continuescan=false, indicating if a new primitive index scan
+ * has been scheduled (otherwise, the top-level scan has run out of tuples in
+ * the current scan direction).
+ *
+ * Caller must use _bt_tuple_before_array_skeys to determine if the current
+ * place in the scan is >= the current array keys _before_ calling here.
+ * We're responsible for ensuring that caller's tuple is <= the newly advanced
+ * required array keys once we return.  We try to find an exact match, but
+ * failing that we'll advance the array keys to whatever set of array elements
+ * comes next in the key space for the current scan direction.  Required array
+ * keys "ratchet forwards" (or backwards).  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The rules are the same for backwards scans, except that the operators are
+ * flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Sometimes that'll
+ * be the sole reason for calling here.  These calls are the only exception to
+ * the general rule about always advancing the array keys.  (That rule only
+ * applies when a required scan key was found to be unsatisfied.)
+ *
+ * Note that we deal with non-array required equality strategy scan keys as
+ * degenerate single element arrays here.  Obviously, they can never really
+ * advance in the way that real arrays can, but they must still affect how we
+ * advance real array scan keys (exactly like true array equality scan keys).
+ * We have to keep around a 3-way ORDER proc for these (using the "=" operator
+ * won't do), since in general whether the tuple is < or > _any_ unsatisfied
+ * required equality key influences how the scan's real arrays must advance.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple.  This is how we deal with inequalities that are
+ * required in the current scan direction.  They can advance the array keys
+ * here, even though they don't influence the initial positioning strategy
+ * within _bt_first (only inequalities required in the _opposite_ direction to
+ * the scan influence _bt_first in this way).  When sktrig (which is an offset
+ * to the unsatisfied scan key set by _bt_check_compare) is for a required
+ * inequality scan key, we'll perform array key advancement.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	int			arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_exhausted,
+				beyond_end_advance = false,
+				sktrig_required = false,
+				has_required_opposite_direction_only = false,
+				oppodir_inequality_sktrig = false,
+				all_required_satisfied = true;
+
+	/*
+	 * Precondition array state assertions
+	 */
+	Assert(!so->needPrimScan && so->advanceDir == dir);
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, false, 0, NULL));
+
+	so->scanBehind = false;		/* reset */
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		FmgrInfo   *orderproc;
+		BTArrayKeyInfo *array = NULL;
+		Datum		tupdatum;
+		bool		required = false,
+					required_opposite_direction_only = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			/* Manage array state */
+			if (cur->sk_flags & SK_SEARCHARRAY)
+			{
+				array = &so->arrayKeys[arrayidx++];
+				Assert(array->scan_key == ikey);
+			}
+		}
+		else
+		{
+			/*
+			 * Are any inequalities required in the opposite direction only
+			 * present here?
+			 */
+			if (((ScanDirectionIsForward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQBKWD))) ||
+				 (ScanDirectionIsBackward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQFWD)))))
+				has_required_opposite_direction_only =
+					required_opposite_direction_only = true;
+		}
+
+		/* Optimization: skip over known-satisfied scan keys */
+		if (ikey < sktrig)
+			continue;
+
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			required = true;
+
+			if (ikey == sktrig)
+				sktrig_required = true;
+
+			if (cur->sk_attno > ntupatts)
+			{
+				/* Set this just like _bt_tuple_before_array_skeys */
+				Assert(sktrig < ikey);
+				so->scanBehind = true;
+			}
+		}
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now. (We only do comparisons of any required
+		 * non-array equality scan keys that come after the triggering key.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; we aren't capable of directly
+		 * evaluating required inequality strategy scan keys here, on our own.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(required && all_required_satisfied && !arrays_advanced);
+
+			/* Use "beyond end" advancement.  See below for an explanation. */
+			beyond_end_advance = true;
+			all_required_satisfied = false;
+
+			/*
+			 * Set a flag that remembers that this was an inequality required
+			 * in the opposite scan direction only, that nevertheless
+			 * triggered the call here.
+			 *
+			 * This only happens when an inequality operator (which must be
+			 * strict) encounters a group of NULLs that indicate the end of
+			 * non-NULL values for tuples in the current scan direction.
+			 */
+			if (unlikely(required_opposite_direction_only))
+				oppodir_inequality_sktrig = true;
+
+			continue;
+		}
+
+		/*
+		 * Nothing more for us to do with an inequality strategy scan key that
+		 * wasn't the one that _bt_check_compare stopped on, though.
+		 *
+		 * Note: if our later call to _bt_check_compare (to recheck caller's
+		 * tuple) sets continuescan=false due to finding this same inequality
+		 * unsatisfied (possible when it's required in the scan direction), we
+		 * deal with it via a recursive call.
+		 */
+		else if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Nothing for us to do with an equality strategy scan key that isn't
+		 * marked required, either.
+		 *
+		 * Non-required array scan keys are the only exception.  They're a
+		 * special case in that _bt_check_compare can set continuescan=false
+		 * for them, just as it will given an unsatisfied required scan key.
+		 * It's convenient to follow the same convention, since it results in
+		 * our getting called to advance non-required arrays in the same way
+		 * as required arrays (though we avoid stopping the scan for them).
+		 */
+		else if (!required && !array)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				cur->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_required_satisfied || cur->sk_attno > ntupatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				cur->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		orderproc = &so->orderProcs[so->keyDataMap[ikey]];
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		ratchets = (required && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+											  tupdatum, tupnull,
+											  array, cur, &result);
+
+			/*
+			 * Required arrays only ever ratchet forwards (backwards).
+			 *
+			 * This condition makes it safe for binary searches to skip over
+			 * array elements that the scan must already be ahead of by now.
+			 * That is strictly an optimization.  Our assertion verifies that
+			 * the condition holds, which doesn't depend on the optimization.
+			 */
+			Assert(!ratchets ||
+				   ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+					(ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(required);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling for required scan keys that lack
+		 * a real array to advance, nor for redundant scan keys that couldn't
+		 * be eliminated by _bt_preprocess_keys.  It won't matter if some of
+		 * our "true" array scan keys (or even all of them) are non-required.
+		 */
+		if (required &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		if (result != 0)
+		{
+			/*
+			 * Track whether caller's tuple satisfies our new post-advancement
+			 * qual, though only in respect of its required scan keys.
+			 *
+			 * When it's a non-required array that doesn't match, we can give
+			 * up early, without advancing the array (nor any later
+			 * non-required arrays).  This often saves us an unnecessary
+			 * recheck call to _bt_check_compare.
+			 */
+			Assert(all_required_satisfied);
+			if (required)
+				all_required_satisfied = false;
+			else
+				break;
+		}
+
+		/* Advance array keys, even when set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			cur->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
 		}
 	}
 
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is how top-level index
+	 * scans usually determine that they've run out of tuples to return.
 	 */
-	if (changed || !so->arraysStarted)
+	arrays_exhausted = false;
+	if (beyond_end_advance)
 	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
-}
+		Assert(!all_required_satisfied && sktrig_required);
 
+		if (_bt_advance_array_keys_increment(scan, dir))
+			arrays_advanced = true;
+		else
+			arrays_exhausted = true;
+	}
+
+	if (arrays_advanced)
+	{
+		if (sktrig_required)
+		{
+			/*
+			 * One or more required array keys advanced, so invalidate state
+			 * that tracks whether required-in-opposite-direction-only scan
+			 * keys are already known to be satisfied
+			 */
+			pstate->firstmatch = false;
+
+			/* Shouldn't have to invalidate 'prechecked', though */
+			Assert(!pstate->prechecked);
+		}
+	}
+	else
+		Assert(arrays_exhausted || !sktrig_required);
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	if (arrays_exhausted)
+	{
+		Assert(sktrig_required);
+		Assert(!all_required_satisfied);
+
+		/*
+		 * The top-level index scan ran out of tuples to return
+		 */
+		goto end_toplevel_scan;
+	}
+
+	/*
+	 * Does caller's tuple now match the new qual?  Call _bt_check_compare a
+	 * second time to find out (unless it's already clear that it can't).
+	 */
+	if (all_required_satisfied && arrays_advanced)
+	{
+		int			nsktrig = sktrig + 1;
+
+		if (_bt_check_compare(dir, so, tuple, ntupatts, tupdesc,
+							  false, false, false,
+							  &pstate->continuescan, &nsktrig) &&
+			!so->scanBehind)
+		{
+			/* This tuple satisfies the new qual */
+			return true;
+		}
+
+		/*
+		 * Consider "second pass" handling of required inequalities.
+		 *
+		 * It's possible that our _bt_check_compare call indicated that the
+		 * scan should end due to some unsatisfied inequality that wasn't
+		 * initially recognized as such by us.  Handle this by calling
+		 * ourselves recursively, this time indicating that the trigger is the
+		 * inequality that we missed first time around (and using a set of
+		 * required array/equality keys that are now exact matches for tuple).
+		 *
+		 * We make a strong, general guarantee that every _bt_checkkeys call
+		 * here will advance the array keys to the maximum possible extent
+		 * that we can know to be safe based on caller's tuple alone.  If we
+		 * didn't perform this step, then that guarantee wouldn't quite hold.
+		 */
+		if (unlikely(!pstate->continuescan))
+		{
+			bool		satisfied PG_USED_FOR_ASSERTS_ONLY;
+
+			Assert(so->keyData[nsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * The tuple must use "beyond end" advancement during the
+			 * recursive call, so we cannot possibly end up back here when
+			 * recursing.  We'll consume a small, fixed amount of stack space.
+			 */
+			Assert(!beyond_end_advance);
+
+			/* Advance the array keys a second time for same tuple */
+			satisfied = _bt_advance_array_keys(scan, pstate, tuple, nsktrig);
+
+			/* This tuple doesn't satisfy the inequality */
+			Assert(!satisfied);
+			return false;
+		}
+
+		/*
+		 * Some non-required scan key (from new qual) still not satisfied.
+		 *
+		 * All scan keys required in the current scan direction must still be
+		 * satisfied, though, so we can trust all_required_satisfied below.
+		 *
+		 * Note: it's still too early to tell if the current primitive index
+		 * scan can continue (has_required_opposite_direction_only steps might
+		 * still start a new primitive index scan instead).
+		 */
+	}
+
+	/*
+	 * Postcondition array state assertions (for still-unsatisfied tuples).
+	 *
+	 * Caller's tuple is now < the newly advanced array keys (or > when this
+	 * is a backwards scan) when not all required scan keys from the new qual
+	 * (including any required inequality keys) were found to be satisfied.
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, false, 0, NULL) ==
+		   !all_required_satisfied);
+
+	/*
+	 * When we were called just to deal with "advancing" non-required arrays,
+	 * there's no way that we can need to start a new primitive index scan
+	 * (and it would be wrong to allow it).  Continue ongoing primitive scan.
+	 */
+	if (!sktrig_required)
+		goto continue_prim_scan;
+
+	/*
+	 * By here we have established that the scan's required arrays were
+	 * advanced, and that they haven't become exhausted.
+	 */
+	Assert(arrays_advanced || !arrays_exhausted);
+
+	/*
+	 * We generally permit primitive index scans to continue onto the next
+	 * sibling page when the page's finaltup satisfies all required scan keys
+	 * at the point where we're between pages.
+	 *
+	 * If caller's tuple is also the page's finaltup, and we see that required
+	 * scan keys still aren't satisfied, start a new primitive index scan.
+	 */
+	if (!all_required_satisfied && pstate->finaltup == tuple)
+		goto new_prim_scan;
+
+	/*
+	 * Proactively check finaltup (don't wait until finaltup is reached by the
+	 * scan) when it might well turn out to not be satisfied later on.
+	 *
+	 * This isn't quite equivalent to looking ahead to check if finaltup will
+	 * also be satisfied by all required scan keys, since there isn't any real
+	 * handling of inequalities in _bt_tuple_before_array_skeys.  It wouldn't
+	 * make sense for us to evaluate inequalities when "looking ahead to
+	 * finaltup", though.  Inequalities that are required in the current scan
+	 * direction cannot affect how _bt_first repositions the top-level scan
+	 * (unless the scan direction happens to change).
+	 *
+	 * Note: if so->scanBehind hasn't already been set for finaltup by us,
+	 * it'll be set during this call to _bt_tuple_before_array_skeys.  Either
+	 * way it'll be set correctly after this point.
+	 */
+	if (!all_required_satisfied && pstate->finaltup &&
+		_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, false, 0,
+									 &so->scanBehind))
+		goto new_prim_scan;
+
+	/*
+	 * When we encounter a truncated finaltup high key attribute, we're
+	 * optimistic about the chances of its corresponding required scan key
+	 * being satisfied when we go on to check it against tuples from this
+	 * page's right sibling leaf page.  We consider truncated attributes to be
+	 * satisfied by required scan keys, which allows the primitive index scan
+	 * to continue to the next leaf page.  We must set so->scanBehind to true
+	 * to remember that the last page's finaltup had "satisfied" required scan
+	 * keys for one or more truncated attribute values (scan keys required in
+	 * _either_ scan direction).
+	 *
+	 * There is a chance that _bt_checkkeys (which checks so->scanBehind) will
+	 * find that even the sibling leaf page's finaltup is < the new array
+	 * keys.  When that happens, our optimistic policy will have incurred a
+	 * single extra leaf page access that could have been avoided.
+	 *
+	 * A pessimistic policy would give backward scans a gratuitous advantage
+	 * over forward scans.  We'd punish forward scans for applying more
+	 * accurate information from the high key, rather than just using the
+	 * final non-pivot tuple as finaltup, in the style of backward scans.
+	 * Being pessimistic would also give some scans with non-required arrays a
+	 * perverse advantage over similar scans that use required arrays instead.
+	 *
+	 * You can think of this as a speculative bet on what the scan is likely
+	 * to find on the next page.  It's not much of a gamble, though, since the
+	 * untruncated prefix of attributes must strictly satisfy the new qual
+	 * (though it's okay if any non-required scan keys fail to be satisfied).
+	 */
+	if (so->scanBehind && has_required_opposite_direction_only)
+	{
+		/*
+		 * However, we avoid this behavior whenever the scan involves a scan
+		 * key required in the opposite direction to the scan only, along with
+		 * a finaltup with at least one truncated attribute that's associated
+		 * with a scan key marked required (required in either direction).
+		 *
+		 * _bt_check_compare simply won't stop the scan for a scan key that's
+		 * marked required in the opposite scan direction only.  That leaves
+		 * us without any reliable way of reconsidering any opposite-direction
+		 * inequalities if it turns out that starting a new primitive index
+		 * scan will allow _bt_first to skip ahead by a great many leaf pages
+		 * (see next section for details of how that works).
+		 */
+		goto new_prim_scan;
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 * They can also signal that we should start a new primitive index scan.
+	 *
+	 * It's possible that the scan is now positioned where "matching" tuples
+	 * begin, and that caller's tuple satisfies all scan keys required in the
+	 * current scan direction.  But if caller's tuple still doesn't satisfy
+	 * other scan keys that are required in the opposite scan direction only
+	 * (e.g., a required >= strategy scan key when scan direction is forward),
+	 * it's still possible that there are many leaf pages before the page that
+	 * _bt_first could skip straight to.  Groveling through all those pages
+	 * will always give correct answers, but it can be very inefficient.  We
+	 * must avoid needlessly scanning extra pages.
+	 *
+	 * Separately, it's possible that _bt_check_compare set continuescan=false
+	 * for a scan key that's required in the opposite direction only.  This is
+	 * a special case, that happens only when _bt_check_compare sees that the
+	 * inequality encountered a NULL value.  This signals the end of non-NULL
+	 * values in the current scan direction, which is reason enough to end the
+	 * (primitive) scan.  If this happens at the start of a large group of
+	 * NULL values, then we shouldn't expect to be called again until after
+	 * the scan has already read indefinitely-many leaf pages full of tuples
+	 * with NULL suffix values.  We need a separate test for this case so that
+	 * we don't miss our only opportunity to skip over such a group of pages.
+	 *
+	 * Apply a test against finaltup to detect and recover from the problem:
+	 * if even finaltup doesn't satisfy such an inequality, we just skip by
+	 * starting a new primitive index scan.  When we skip, we know for sure
+	 * that all of the tuples on the current page following caller's tuple are
+	 * also before the _bt_first-wise start of tuples for our new qual.  That
+	 * at least suggests many more skippable pages beyond the current page.
+	 */
+	if (has_required_opposite_direction_only && pstate->finaltup &&
+		(all_required_satisfied || oppodir_inequality_sktrig))
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped;
+		bool		continuescanflip;
+		int			opsktrig;
+
+		/*
+		 * We're checking finaltup (which is usually not caller's tuple), so
+		 * cannot reuse work from caller's earlier _bt_check_compare call.
+		 *
+		 * Flip the scan direction when calling _bt_check_compare this time,
+		 * so that it will set continuescanflip=false when it encounters an
+		 * inequality required in the opposite scan direction.
+		 */
+		Assert(!so->scanBehind);
+		opsktrig = 0;
+		flipped = -dir;
+		_bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+						  tupdesc, false, false, false,
+						  &continuescanflip, &opsktrig);
+
+		/*
+		 * If we ended up here due to the all_required_satisfied criteria,
+		 * test opsktrig in a way that ensures that finaltup contains the same
+		 * prefix of key columns as caller's tuple (a prefix that satisfies
+		 * earlier required-in-current-direction scan keys).
+		 *
+		 * If we ended up here due to the oppodir_inequality_sktrig criteria,
+		 * test opsktrig in a way that ensures that the same scan key that our
+		 * caller found to be unsatisfied (by the scan's tuple) was also the
+		 * one unsatisfied just now (by finaltup).  That way we'll only start
+		 * a new primitive scan when we're sure that both tuples _don't_ share
+		 * the same prefix of satisfied equality-constrained attribute values,
+		 * and that finaltup has a non-NULL attribute value indicated by the
+		 * unsatisfied scan key at offset opsktrig/sktrig.  (This depends on
+		 * _bt_check_compare not caring about the direction that inequalities
+		 * are required in whenever NULL attribute values are unsatisfied.  It
+		 * only cares about the scan direction, and its relationship to
+		 * whether NULLs are stored first or last relative to non-NULLs.)
+		 */
+		Assert(all_required_satisfied != oppodir_inequality_sktrig);
+		if (unlikely(!continuescanflip &&
+					 ((all_required_satisfied && opsktrig > sktrig) ||
+					  (oppodir_inequality_sktrig && opsktrig == sktrig))))
+		{
+			Assert(so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * Make sure that any non-required arrays are set to the first
+			 * array element for the current scan direction
+			 */
+			_bt_rewind_nonrequired_arrays(scan, dir);
+
+			goto new_prim_scan;
+		}
+	}
+
+continue_prim_scan:
+
+	/*
+	 * Stick with the ongoing primitive index scan for now.
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will quickly discover the next tuple covered
+	 * by the current array keys.  Repeated binary searches of the page (one
+	 * binary search per array advancement) is unlikely to outperform one
+	 * continuous linear search of the whole page.
+	 */
+	pstate->continuescan = true;	/* Override _bt_check_compare */
+	so->needPrimScan = false;	/* _bt_readpage has more tuples to check */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+new_prim_scan:
+
+	/*
+	 * End this primitive index scan, but scheduled another
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = true;	/* ...but call _bt_first again */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+end_toplevel_scan:
+
+	/*
+	 * End the current primitive index scan, but don't schedule another.
+	 *
+	 * This ends the entire top-level scan.
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = false;	/* ...don't call _bt_first again, though */
+
+	/* Caller's tuple doesn't match any qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
  *
- * The given search-type keys (in scan->keyData[] or so->arrayKeyData[])
+ * The given search-type keys (taken from scan->keyData[])
  * are copied to so->keyData[] with possible transformation.
  * scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
  * the number of output keys (possibly less, never greater).
@@ -692,7 +2065,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
  * within each attribute may be done as a byproduct of the processing here,
- * but no other code depends on that.
+ * but no other code depends on that.  Note that index scans with array scan
+ * keys depend on state (maintained here by us) that maps each of our input
+ * scan keys to its corresponding output scan key.  This indirection allows
+ * index scans to use an ikey offset-to-output-scankey to look up the cached
+ * ORDER proc for the scankey.
  *
  * The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
  * if they must be satisfied in order to continue the scan forward or backward
@@ -741,6 +2118,14 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * _bt_checkkeys needs to be able to perform in-place updates of the scan keys
+ * output here by us.  This is the final step it performs in order to advance
+ * the scan's array keys.  The rules for redundancy/contradictoriness work a
+ * little differently when array-type scan keys are involved.  We need to
+ * consider every possible set of array keys.  During scans with array keys,
+ * only the first call here (per btrescan) will actually do any real work.
+ * Later calls just assert that _bt_checkkeys set things up correctly.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -748,9 +2133,9 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * about required keys.
  *
  * Note: the reason we have to copy the preprocessed scan keys into private
- * storage is that we are modifying the array based on comparisons of the
- * key argument values, which could change on a rescan or after moving to
- * new elements of array keys.  Therefore we can't overwrite the source data.
+ * storage is that we are modifying the array based on comparisons of the key
+ * argument values, which could change on a rescan.  Therefore we can't
+ * overwrite the source data.
  */
 void
 _bt_preprocess_keys(IndexScanDesc scan)
@@ -762,12 +2147,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfEqualCols;
 	ScanKey		inkeys;
 	ScanKey		outkeys;
+	int		   *keyDataMap = NULL;
 	ScanKey		cur;
-	ScanKey		xform[BTMaxStrategyNumber];
+	ScanKeyAttr xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
 	AttrNumber	attno;
+	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+
+	Assert(!so->needPrimScan);
+
+	/*
+	 * We're called at the start of each primitive index scan during top-level
+	 * scans that use equality array keys.  We can reuse the scan keys that
+	 * were output at the start of the scan's first primitive index scan.
+	 * There is no need to perform exactly the same work more than once.
+	 */
+	if (so->numberOfKeys > 0)
+	{
+		/*
+		 * An earlier call to _bt_advance_array_keys already set everything up
+		 * for us.  Just assert that the scan's existing output scan keys are
+		 * consistent with its current array elements.
+		 */
+		Assert(so->numArrayKeys && !ScanDirectionIsNoMovement(so->advanceDir));
+		Assert(_bt_verify_keys_with_arraykeys(scan));
+		return;
+	}
+
+	Assert(ScanDirectionIsNoMovement(so->advanceDir));
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -776,11 +2185,31 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
+	arrayKeyData = _bt_preprocess_array_keys(scan);
+	if (!so->qual_ok)
+	{
+		/* unmatchable array, so give up */
+		so->qual_ok = false;
+		return;
+	}
+
 	/*
-	 * Read so->arrayKeyData if array keys are present, else scan->keyData
+	 * Treat arrayKeyData as our input if _bt_preprocess_array_keys just
+	 * allocated it, else just use scan->keyData.
 	 */
-	if (so->arrayKeyData != NULL)
-		inkeys = so->arrayKeyData;
+	if (arrayKeyData != NULL)
+	{
+		/*
+		 * Maintain a mapping from input scan keys to our final output scan
+		 * keys.  This gives _bt_advance_array_keys a convenient way to look
+		 * up each equality scan key's ORDER proc (including but not limited
+		 * to scan keys used for arrays).  The ORDER proc array stores entries
+		 * in the same order as corresponding scan keys appear in inkeys.
+		 */
+		inkeys = arrayKeyData;
+		keyDataMap = so->keyDataMap;
+	}
 	else
 		inkeys = scan->keyData;
 
@@ -801,6 +2230,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
 			_bt_mark_scankey_required(outkeys);
+		if (keyDataMap)
+			keyDataMap[0] = 0;
+
 		return;
 	}
 
@@ -858,15 +2290,16 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 * unsatisfiable in combination with any other index condition. By
 			 * the time we get here, that's been classified as an equality
 			 * check, and we've rejected any combination of it with a regular
-			 * equality condition; but not with other types of conditions.
+			 * equality condition (including those used with array keys); but
+			 * not with other types of conditions.
 			 */
-			if (xform[BTEqualStrategyNumber - 1])
+			if (xform[BTEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1];
+				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
 
 				for (j = BTMaxStrategyNumber; --j >= 0;)
 				{
-					ScanKey		chk = xform[j];
+					ScanKey		chk = xform[j].skey;
 
 					if (!chk || j == (BTEqualStrategyNumber - 1))
 						continue;
@@ -878,8 +2311,28 @@ _bt_preprocess_keys(IndexScanDesc scan)
 						return;
 					}
 
-					if (_bt_compare_scankey_args(scan, chk, eq, chk,
-												 &test_result))
+					if (eq->sk_flags & SK_SEARCHARRAY)
+					{
+						/*
+						 * Don't try to prove redundancy in the event of an
+						 * inequality strategy scan key that looks like it
+						 * might contradict a subset of the array elements
+						 * from some equality scan key's array.  Just keep
+						 * both keys.
+						 *
+						 * Ideally, we'd handle this by adding a preprocessing
+						 * step that eliminates the subset of array elements
+						 * that the inequality ipso facto rules out (and
+						 * eliminates the inequality itself, too).  But that
+						 * seems like a lot of code for such a small benefit
+						 * (_bt_checkkeys is already capable of advancing the
+						 * array keys by a great many elements in one step,
+						 * without requiring too many cycles compared to
+						 * sophisticated preprocessing).
+						 */
+					}
+					else if (_bt_compare_scankey_args(scan, chk, eq, chk,
+													  &test_result))
 					{
 						if (!test_result)
 						{
@@ -888,7 +2341,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							return;
 						}
 						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						xform[j].skey = NULL;
+						xform[j].ikey = -1;
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -897,36 +2351,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			}
 
 			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1]
-				&& xform[BTLessEqualStrategyNumber - 1])
+			if (xform[BTLessStrategyNumber - 1].skey
+				&& xform[BTLessEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1];
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1];
+				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
 
 				if (_bt_compare_scankey_args(scan, le, lt, le,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1] = NULL;
+						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTLessStrategyNumber - 1] = NULL;
+						xform[BTLessStrategyNumber - 1].skey = NULL;
 				}
 			}
 
 			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1]
-				&& xform[BTGreaterEqualStrategyNumber - 1])
+			if (xform[BTGreaterStrategyNumber - 1].skey
+				&& xform[BTGreaterEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1];
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1];
+				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
 
 				if (_bt_compare_scankey_args(scan, ge, gt, ge,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1] = NULL;
+						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTGreaterStrategyNumber - 1] = NULL;
+						xform[BTGreaterStrategyNumber - 1].skey = NULL;
 				}
 			}
 
@@ -937,11 +2391,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 */
 			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				if (xform[j])
+				if (xform[j].skey)
 				{
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-					memcpy(outkey, xform[j], sizeof(ScanKeyData));
+					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+					if (keyDataMap)
+						keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 					if (priorNumberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
 				}
@@ -961,12 +2417,29 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* check strategy this key's operator corresponds to */
 		j = cur->sk_strategy - 1;
 
+		/*
+		 * Is this an array scan key that _bt_preprocess_array_keys merged
+		 * into an earlier array key against the same attribute?
+		 */
+		if (cur->sk_strategy == InvalidStrategy)
+		{
+			/*
+			 * key is redundant for this primitive index scan (and will be
+			 * redundant during all subsequent primitive index scans)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+
+			continue;
+		}
+
 		/* if row comparison, push it directly to the output array */
 		if (cur->sk_flags & SK_ROW_HEADER)
 		{
 			ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
+			if (keyDataMap)
+				keyDataMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 
@@ -978,20 +2451,75 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
-		/* have we seen one of these before? */
-		if (xform[j] == NULL)
+		/*
+		 * have we seen a scan key for this same attribute and using this same
+		 * operator strategy before now?
+		 */
+		if (xform[j].skey == NULL)
 		{
-			/* nope, so remember this scankey */
-			xform[j] = cur;
+			/* nope, so this scan key wins by default (at least for now) */
+			xform[j].skey = cur;
+			xform[j].ikey = i;
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
-										 &test_result))
+			ScanKey		outkey;
+
+			/*
+			 * Seen one of these before, so keep only the more restrictive key
+			 * if possible
+			 */
+			if (j == (BTEqualStrategyNumber - 1) &&
+				((xform[j].skey->sk_flags & SK_SEARCHARRAY) ||
+				 (cur->sk_flags & SK_SEARCHARRAY)) &&
+				!(cur->sk_flags & SK_SEARCHNULL))
 			{
+				/*
+				 * But don't discard the existing equality key if it's an
+				 * array scan key.  We can't conclude that the key is truly
+				 * redundant with an array.  The only exception is "key IS
+				 * NULL" keys, which eliminate every possible array element
+				 * (and so ipso facto make the whole qual contradictory).
+				 *
+				 * Note: redundant and contradictory array keys will have
+				 * already been dealt with by _bt_merge_arrays in the most
+				 * important cases.  Ideally, _bt_merge_arrays would also be
+				 * able to handle all equality keys as "degenerate single
+				 * value arrays", but for now we're better off leaving it up
+				 * to _bt_checkkeys to advance the array keys.
+				 *
+				 * Note: another possible solution to this problem is to
+				 * perform incremental array advancement here instead.  That
+				 * doesn't seem particularly appealing, since it won't perform
+				 * acceptably during scans that have an extremely large number
+				 * of distinct array key combinations (typically due to the
+				 * presence of multiple arrays, each containing merely a large
+				 * number of distinct elements).
+				 *
+				 * Likely only redundant for a subset of array elements...
+				 */
+			}
+			else if (!_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+											   &test_result))
+			{
+				/*
+				 * Cannot determine redundancy because opfamily doesn't supply
+				 * a complete set of cross-type operators...
+				 */
+			}
+			else
+			{
+				/* Have all we need to determine redundancy */
 				if (test_result)
-					xform[j] = cur;
+				{
+					Assert(!(xform[j].skey->sk_flags & SK_SEARCHARRAY) ||
+						   xform[j].skey->sk_strategy != BTEqualStrategyNumber);
+
+					/* New key is more restrictive, and so replaces old key */
+					xform[j].skey = cur;
+					xform[j].ikey = i;
+					continue;
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -999,26 +2527,163 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					return;
 				}
 				/* else old key is more restrictive, keep it */
+				continue;
 			}
-			else
-			{
-				/*
-				 * We can't determine which key is more restrictive.  Keep the
-				 * previous one in xform[j] and push this one directly to the
-				 * output array.
-				 */
-				ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-				memcpy(outkey, cur, sizeof(ScanKeyData));
-				if (numberOfEqualCols == attno - 1)
-					_bt_mark_scankey_required(outkey);
+			/*
+			 * ...so keep both keys.
+			 *
+			 * We can't determine which key is more restrictive (or we can't
+			 * eliminate an array scan key).  Replace it in xform[j], and push
+			 * the cur one directly to the output array, too.
+			 */
+			outkey = &outkeys[new_numberOfKeys++];
+
+			memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+			if (keyDataMap)
+				keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
+			if (numberOfEqualCols == attno - 1)
+				_bt_mark_scankey_required(outkey);
+			xform[j].skey = cur;
+			xform[j].ikey = i;
+		}
+	}
+
+	/*
+	 * When _bt_preprocess_array_keys performed array preprocessing, it set
+	 * each array's array->scan_key to the array's arrayKeys[] entry offset.
+	 *
+	 * Now that we've output so->keyData[], and built a mapping from
+	 * so->keyData[] (output scan keys) to scan->keyData[] (input scan keys),
+	 * fix the array->scan_key references.  (This relies on the assumption
+	 * that arrayKeys[] has essentially the same entries as scan->keyData[]).
+	 */
+	if (arrayKeyData)
+	{
+		int			arrayidx = 0;
+
+		for (int output_ikey = 0;
+			 output_ikey < new_numberOfKeys;
+			 output_ikey++)
+		{
+			ScanKey		outkey = so->keyData + output_ikey;
+			int			input_ikey = keyDataMap[output_ikey];
+
+			if (!(outkey->sk_flags & SK_SEARCHARRAY) ||
+				outkey->sk_strategy != BTEqualStrategyNumber)
+				continue;
+
+			for (; arrayidx < so->numArrayKeys; arrayidx++)
+			{
+				BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
+
+				if (array->scan_key == input_ikey)
+				{
+					array->scan_key = output_ikey;
+					break;
+				}
 			}
 		}
+
+		/* We could pfree(arrayKeyData) now, but not worth the cycles */
 	}
 
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * Verify that the scan's qual state matches what we expect at the point that
+ * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
+ *
+ * We enforce a rule against non-required array scan keys: they must start out
+ * with whatever element is the first for the scan's current scan direction.
+ * See _bt_rewind_nonrequired_arrays comments for an explanation.
+ */
+static bool
+_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
+
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+			return false;
+	}
+
+	return _bt_verify_keys_with_arraykeys(scan);
+}
+
+/*
+ * Verify that the scan's "so->keyData[]" scan keys are in agreement with
+ * its array key state
+ */
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			last_proc_map = -1,
+				last_sk_attno = 0,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		if (array->scan_key != ikey)
+			return false;
+
+		/*
+		 * Verify that so->keyDataMap[] mappings are in order for
+		 * SK_SEARCHARRAY equality strategy scan keys
+		 */
+		if (last_proc_map >= so->keyDataMap[ikey])
+			return false;
+		last_proc_map = so->keyDataMap[ikey];
+
+		if (cur->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+		if (last_sk_attno > cur->sk_attno)
+			return false;
+		last_sk_attno = cur->sk_attno;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1352,60 +3017,191 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
  * Forward scan callers can pass a high key tuple in the hopes of having
  * us set *continuescan to false, and avoiding an unnecessary visit to
  * the page to the right.
  *
+ * Advances the scan's array keys when necessary for arrayKeys=true callers.
+ * Caller can avoid all array related side-effects when calling just to do a
+ * page continuescan precheck -- pass arrayKeys=false for that.  Scans without
+ * any arrays keys must always pass arrayKeys=false.
+ *
+ * Also stops and starts primitive index scans for arrayKeys=true callers.
+ * Scans with array keys are required to set up page state that helps us with
+ * this.  The page's finaltup tuple (the page high key for a forward scan, or
+ * the page's first non-pivot tuple for a backward scan) must be set in
+ * pstate.finaltup ahead of the first call here for the page (or possibly the
+ * first call after an initial continuescan-setting page precheck call).  Set
+ * this to NULL for rightmost page (or the leftmost page for backwards scans).
+ *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: page level input and output parameters
+ * arrayKeys: should we advance the scan's array keys if necessary?
  * tuple: index tuple to test
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
- * 						   be true for the last item on the page
- * haveFirstMatch: indicates that we already have at least one match
- * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
-			  bool continuescanPrechecked, bool haveFirstMatch)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+			  IndexTuple tuple, int tupnatts)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanDirection dir = pstate->dir;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!arrayKeys || (so->advanceDir == dir && so->arrayKeys));
+	Assert(!so->scanBehind || (arrayKeys && ScanDirectionIsForward(dir)));
+	Assert(!so->needPrimScan);
 
+	res = _bt_check_compare(dir, so, tuple, tupnatts, tupdesc,
+							arrayKeys, pstate->prechecked, pstate->firstmatch,
+							&pstate->continuescan, &ikey);
+
+#ifdef USE_ASSERT_CHECKING
+	if (pstate->prechecked || pstate->firstmatch)
+	{
+		bool		dcontinuescan;
+		int			dikey = 0;
+
+		Assert(res == _bt_check_compare(dir, so, tuple, tupnatts, tupdesc,
+										arrayKeys, false, false,
+										&dcontinuescan, &dikey));
+		Assert(dcontinuescan == pstate->continuescan && ikey == dikey);
+	}
+#endif
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality strategy array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * pstate.continuescan=false.
+	 */
+	if (!arrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.  This could mean that the tuple is just past
+	 * the end of matches for the current array keys.
+	 *
+	 * It's also possible that the scan is still _before_ the _start_ of
+	 * tuples matching the current set of array keys.  Check for that first.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, dir, tuple, true, ikey, NULL))
+	{
+		/*
+		 * Tuple is still before the start of matches according the the scan's
+		 * required array keys (according to _all_ of its required equality
+		 * strategy keys, actually).
+		 *
+		 * Note: we will end up here repeatedly given a group of tuples > the
+		 * previous array keys and < the now-current keys (though only when
+		 * _bt_advance_array_keys determined that key space relevant to the
+		 * scan covers some of the page's remaining unscanned tuples).
+		 *
+		 * _bt_advance_array_keys occasionally sets so->scanBehind to signal
+		 * that the scan's current position/tuples might be significantly
+		 * behind (multiple pages behind) its current array keys.  When this
+		 * happens, we check the page finaltup ourselves.  We'll start a new
+		 * primitive index scan on our own if it turns out that the scan isn't
+		 * now on a page that has at least some tuples covered by the key
+		 * space of the arrays.
+		 *
+		 * This scheme allows _bt_advance_array_keys to optimistically assume
+		 * that the scan will find array key matches for any truncated
+		 * finaltup attributes once the scan reaches the right sibling page
+		 * (only the untruncated prefix have to match the scan's array keys).
+		 */
+		Assert(!so->scanBehind ||
+			   so->keyData[ikey].sk_strategy == BTEqualStrategyNumber);
+		if (unlikely(so->scanBehind) && pstate->finaltup &&
+			_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, false,
+										 0, NULL))
+		{
+			/* Cut our losses -- start a new primitive index scan now */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/* Override _bt_check_compare, continue primitive scan */
+			pstate->continuescan = true;
+		}
+
+		/* This indextuple doesn't match the current qual, in any case */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scan).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (unless the required array keys become exhausted instead, or
+	 * unless the ikey trigger corresponds to a non-required array scan key).
+	 *
+	 * Note: we might advance the required arrays when all existing keys are
+	 * already equal to the values from the tuple at this point.  See comments
+	 * above _bt_advance_array_keys about inequality driven array advancement.
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, ikey);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also sets *continuescan to false
+ * when it's also not possible for any later tuples to pass the current qual
+ * (with the scan's current set of array keys, in the current scan direction),
+ * in addition to setting *ikey to the so->keyData[] subscript/offset for the
+ * unsatisfied scan key (needed when caller must consider advancing the scan's
+ * array keys).
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys ends the ongoing
+ * primitive index scan.  It is up to our caller to override that initial
+ * determination when it makes more sense to advance the array keys and
+ * continue with further tuples from the same leaf page.
+ *
+ * Note: we set *continuescan to false for arrayKeys=true callers in the event
+ * of an unsatisfied non-required array equality scan key, despite the fact
+ * that it's never safe to end the current primitive index scan when that
+ * happens.  Caller will still need to consider "advancing" the array keys
+ * (which isn't all that different to what happens to truly required arrays).
+ * Caller _must_ unset continuescan once non-required arrays have advanced.
+ * Callers that pass arrayKeys=false won't get this behavior, which is useful
+ * when the focus is on whether the scan's required scan keys are satisfied.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool arrayKeys, bool prechecked, bool firstmatch,
+				  bool *continuescan, int *ikey)
+{
 	*continuescan = true;		/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (; *ikey < so->numberOfKeys; (*ikey)++)
 	{
+		ScanKey		key = so->keyData + *ikey;
 		Datum		datum;
 		bool		isNull;
-		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1423,8 +3219,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
-			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
+		if (prechecked &&
+			(requiredSameDir || (requiredOppositeDirOnly && firstmatch)) &&
+			!(key->sk_flags & SK_ROW_HEADER))
 			continue;
 
 		if (key->sk_attno > tupnatts)
@@ -1435,7 +3232,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1496,6 +3292,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1512,6 +3310,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1525,24 +3325,15 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key-checking function, though only if we must.
+		 *
+		 * When a key is required in the opposite-of-scan direction _only_,
+		 * then it must already be satisfied if firstmatch=true indicates that
+		 * an earlier tuple from this same page satisfied it earlier on.
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
-		{
-			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-									 datum, key->sk_argument);
-		}
-		else
-		{
-			test = true;
-			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
-											 datum, key->sk_argument));
-		}
-
-		if (!DatumGetBool(test))
+		if (!(requiredOppositeDirOnly && firstmatch) &&
+			!DatumGetBool(FunctionCall2Coll(&key->sk_func, key->sk_collation,
+											datum, key->sk_argument)))
 		{
 			/*
 			 * Tuple fails this qual.  If it's a required qual for the current
@@ -1557,6 +3348,14 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			if (requiredSameDir)
 				*continuescan = false;
 
+			/*
+			 * Also set continuescan=false for non-required equality-type
+			 * array keys that don't pass (during arrayKeys=true calls)
+			 */
+			if (arrayKeys && (key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+				*continuescan = false;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1575,7 +3374,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1604,7 +3403,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
@@ -1631,6 +3429,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1647,6 +3447,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbd..2230b1310 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,12 +816,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
 	int			indexcol;
 
+	Assert(skip_nonnative_saop != NULL || scantype == ST_BITMAPSCAN);
+
 	/*
 	 * Check that index supports the desired scan type(s)
 	 */
@@ -880,19 +849,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +864,18 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			if (skip_nonnative_saop && !index->amsearcharray &&
+				IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
-				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
+				/*
+				 * Caller asked us to generate IndexPaths that omit any
+				 * ScalarArrayOpExpr clauses when the underlying index AM
+				 * lacks native support.
+				 *
+				 * We must omit this clause (and tell caller about it).
+				 */
+				*skip_nonnative_saop = true;
+				continue;
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d..47de61da1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6557,8 +6557,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6573,7 +6571,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6603,19 +6601,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6653,27 +6640,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6683,11 +6674,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6703,10 +6692,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6719,7 +6706,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6797,7 +6784,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6812,17 +6798,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6862,14 +6843,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				double		alength = estimate_array_length(root, other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6929,13 +6905,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6945,6 +6914,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6952,9 +6963,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +6982,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4b8b38b70..c7df1a9b9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4063,6 +4063,21 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 79fa117cb..267cb7282 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1932,16 +1932,16 @@ ORDER BY unique1;
       42
 (3 rows)
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1952,26 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 960540002..a031d2341 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8880,10 +8880,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..90a33795d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -765,6 +765,7 @@ SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -774,18 +775,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.43.0

#46

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#45)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.
I don't think the "see <functions-comparisons> for details" is
correctly worded: The surrounding text implies that it would contain
details about in which precise situations multiple primitive index
scans would be consumed, while it only contains documentation about
IN/NOT IN/ANY/ALL/SOME.

Something like the following would fit better IMO:
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list or array of multiple scalar values (such as those
described in
+    <functions-comparisons> might perform multiple <quote>primitive</quote>
+    index scans (up to one primitive scan per scalar value) at runtime.
I think that there is supposed to be a closing parenthesis here? So
"... (such as those described in <functions-comparisons>") might
perform...".

Correct.

FWM, if that's what you meant.

WFM, yes?

Then there is a second issue in the paragraph: Inverted indexes such
as GIN might well decide to start counting more than one "primitive
scan" per scalar value,

[...]

I've described the issues in this area (in the docs) in a way that is
most consistent with historical conventions. That seems to have the
fewest problems, despite everything I've said about it.

Clear enough, thank you for explaining your thoughts on this.

All that really remains now is to research how we might integrate this
work with the recently added continuescanPrechecked/haveFirstMatch
stuff from Alexander Korotkov, if at all.

The main change in v12 is that I've integrated both the
continuescanPrechecked and the haveFirstMatch optimizations. Both of
these fields are now page-level state, shared between the _bt_readpage
caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
appear next to the new home for _bt_checkkeys' continuescan field, in
the new page state struct).

Cool. I'm planning to review the rest of this patch this
week/tomorrow, could you take some time to review some of my btree
patches, too?

Okay, I'll take a look again.

Thanks, greatly appreciated.

At one point Heikki suggested that I just get rid of
BTScanOpaqueData.arrayKeyData (which has been there for as long as
nbtree had native support for SAOPs), and use
BTScanOpaqueData.keyData exclusively instead. I've finally got around
to doing that now.

I'm not sure if it was worth the reduced churn when the changes for
the primary optimization are already 150+kB in size; every "small"
addition increases the context needed to review the patch, and it's
already quite complex.

These simplifications were enabled by my new approach within
_bt_preprocess_keys, described when I posted v12. v13 goes even
further than v12 did, by demoting _bt_preprocess_array_keys to a
helper function for _bt_preprocess_keys. That means that we do all of
our scan key preprocessing at the same time, at the start of _bt_first
(though only during the first _bt_first, or to be more precise during
the first per btrescan). If we need fancier
preprocessing/normalization for arrays, then it ought to be a lot
easier with this structure.

Agreed.

Note that we no longer need to have an independent representation of
so->qual_okay for array keys (the convention of setting
so->numArrayKeys to -1 for unsatisfiable array keys is no longer
required). There is no longer any need for a separate pass to carry
over the contents of BTScanOpaqueData.arrayKeyData to
BTScanOpaqueData.keyData, which was confusing.

I wasn't very confused by it, but sure.

Are you still interested in working directly on the preprocessing
stuff?

If you mean my proposed change to merging two equality AOPs, then yes.
I'll try to fit it in tomorrow with the rest of the review.

I have a feeling that I was slightly too optimistic about how
likely we were to be able to get away with not having certain kinds of
array preprocessing, back when I posted v12. It's true that the
propensity of the patch to not recognize "partial
redundancies/contradictions" hardly matters with redundant equalities,
but inequalities are another matter. I'm slightly worried about cases
like this one:

select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;

Maybe we need to do better with that. What do you think?

Let me come back to that when I'm done reviewing the final part of nbtutils.

Attached is v13.

It looks like there are few issues remaining outside the changes in
nbtutils. I've reviewed the changes to those files, and ~half of
nbtutils (up to _bt_advance_array_keys_increment) now. I think I can
get the remainder done tomorrow.

+++ b/src/backend/access/nbtree/nbtutils.c
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
* Set up the cur_elem counters and fill in the first sk_argument value for
- * each array scankey.  We can't do this until we know the scan direction.
+ * each array scankey.
*/
void
_bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int            i;

+ Assert(so->numArrayKeys);
+ Assert(so->qual_ok);

Has the requirement for a known scan direction been removed, or should
this still have an Assert(dir != NoMovementScanDirection)?

+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1713,17 +1756,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
[...]
-            _bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+            pstate.prechecked = false;    /* prechecked earlier tuple */

I'm not sure that comment is correct, at least it isn't as clear as
can be. Maybe something more in the line of the following?
+ pstate.prechecked = false; /* prechecked didn't cover HIKEY */

+++ b/src/backend/access/nbtree/nbtutils.c

@@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
+        elemtype = cur->sk_subtype;
+        if (elemtype == InvalidOid)
+            elemtype = rel->rd_opcintype[cur->sk_attno - 1];

Should we Assert() that this elemtype matches the array element type
ARR_ELEMTYPE(arrayval) used to deconstruct the array?

+        /*
+         * If this scan key is semantically equivalent to a previous equality
+         * operator array scan key, merge the two arrays together to eliminate
[...]
+        if (prevArrayAtt == cur->sk_attno && prevElemtype == elemtype)

This is not "a" previous equality key, but "the" previous equality
operator array scan key.
Do we want to expend some additional cycles for detecting duplicate
equality array key types in interleaved order like =int[] =bigint[]
=int[]? I don't think it would be very expensive considering the
limited number of cross-type equality operators registered in default
PostgreSQL, so a simple loop that checks matching element types
starting at the first array key scankey for this attribute should
suffice. We could even sort the keys by element type if we wanted to
fix any O(n) issues for this type of behaviour (though this is
_extremely_ unlikely to become a performance issue).

+             * qual is unsatisfiable
+             */
+            if (num_elems == 0)
+            {
+                so->qual_ok = false;
+                return NULL;
+            }

I think there's a memory context switch back to oldContext missing
here, as well as above at `if (num_nonnulls == 0)`. These were
probably introduced by changing from break to return; and the paths
aren't yet exercised in regression tests.

+++ b/src/backend/utils/adt/selfuncs.c

has various changes in comments that result in spelling and style issues:

+ * primitive index scans that will be performed for caller

+ * primitive index scans that will be performed for the caller.
(missing "the", period)

-         * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-         * since that's internal to the indexscan.)
+         * share for each outer scan

The trailing period was lost:
+ * share for each outer scan.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#47

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Matthias van de Meent (#46)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote:

At one point Heikki suggested that I just get rid of
BTScanOpaqueData.arrayKeyData (which has been there for as long as
nbtree had native support for SAOPs), and use
BTScanOpaqueData.keyData exclusively instead. I've finally got around
to doing that now.

I'm not sure if it was worth the reduced churn when the changes for
the primary optimization are already 150+kB in size; every "small"
addition increases the context needed to review the patch, and it's
already quite complex.

To clarify, what I mean here is that merging the changes of both the
SAOPs changes and the removal of arrayKeyData seems to increase the
patch size and increases the maximum complexity of each component
patch's review.
Separate patches may make this more reviewable, or not, but no comment
was given on why it is better to merge the changes into a single
patch.

- Matthias

#48

Benoit Tigeot

benoit.tigeot@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#42)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Hello,

I am not up to date with the last version of patch but I did a regular
benchmark with version 11 and typical issue we have at the moment and the
result are still very very good. [1]https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955 __________ Benoit Tigeot

In term of performance improvement the last proposals could be a real game
changer for 2 of our biggest databases. We hope that Postgres 17 will
contain those improvements.

Kind regards,

Benoit

[1]: https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955 __________ Benoit Tigeot
https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955
__________
Benoit Tigeot

Le jeu. 7 mars 2024 à 15:36, Peter Geoghegan <pg@bowt.ie> a écrit :

Show quoted text

On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

I could include something less verbose, mentioning a theoretical risk
to out-of-core amcanorder routines that coevolved with nbtree,
inherited the same SAOP limitations, and then never got the same set
of fixes.

Attached is v11, which now says something like that in the commit
message. Other changes:

* Fixed buggy sorting of arrays using cross-type ORDER procs, by
recognizing that we need to consistently use same-type ORDER procs for
sorting and merging the arrays during array preprocessing.

Obviously, when we sort, we compare array elements to other array
elements (all of the same type). This is true independent of whether
the query itself happens to use a cross type operator/ORDER proc, so
we will need to do two separate ORDER proc lookups in cross-type
scenarios.

* No longer subscript the ORDER proc used for array binary searches
using a scankey subscript. Now there is an additional indirection that
works even in the presence of multiple redundant scan keys that cannot
be detected as such due to a lack of appropriate cross-type support
within an opfamily.

This was subtly buggy before now. Requires a little more coordination
between array preprocessing and standard/primitive index scan
preprocessing, which isn't ideal but seems unavoidable.

* Lots of code polishing, especially within _bt_advance_array_keys().

While _bt_advance_array_keys() still works in pretty much exactly the
same way as it did back in v10, there are now better comments.
Including something about why its recursive call to itself is
guaranteed to use a low, fixed amount of stack space, verified using
an assertion. That addresses a concern held by Matthias.

Outlook
=======

This patch is approaching being committable now. Current plan is to
commit this within the next few weeks.

All that really remains now is to research how we might integrate this
work with the recently added continuescanPrechecked/haveFirstMatch
stuff from Alexander Korotkov, if at all. I've put that off until now
because it isn't particularly fundamental to what I'm doing here, and
seems optional.

I would also like to do more performance validation. Things like the
parallel index scan code could stand to be revisited once again. Plus
I should think about the overhead of array preprocessing when
btrescan() is called many times, from a nested loop join -- I should
have something to say about that concern (raised by Heikki at one
point) before too long.

--
Peter Geoghegan

#49

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#46)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Mar 6, 2024 at 4:46 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote:

I think that there is supposed to be a closing parenthesis here? So
"... (such as those described in <functions-comparisons>") might
perform...".

Correct.

FWM, if that's what you meant.

WFM, yes?

Then we're in agreement on this.

At one point Heikki suggested that I just get rid of
BTScanOpaqueData.arrayKeyData (which has been there for as long as
nbtree had native support for SAOPs), and use
BTScanOpaqueData.keyData exclusively instead. I've finally got around
to doing that now.

I'm not sure if it was worth the reduced churn when the changes for
the primary optimization are already 150+kB in size; every "small"
addition increases the context needed to review the patch, and it's
already quite complex.

I agree that the patch is quite complex, especially relative to its size.

Note that we no longer need to have an independent representation of
so->qual_okay for array keys (the convention of setting
so->numArrayKeys to -1 for unsatisfiable array keys is no longer
required). There is no longer any need for a separate pass to carry
over the contents of BTScanOpaqueData.arrayKeyData to
BTScanOpaqueData.keyData, which was confusing.

I wasn't very confused by it, but sure.

Are you still interested in working directly on the preprocessing
stuff?

If you mean my proposed change to merging two equality AOPs, then yes.
I'll try to fit it in tomorrow with the rest of the review.

I didn't just mean that stuff. I was also suggesting that you could
join the project directly (not just as a reviewer). If you're
interested, you could do general work on the preprocessing of arrays.
Fancier array-specific preprocessing.

For example, something that can transform this: select * from foo
where a in (1,2,3) and a < 3;

Into this: select * from foo where a in (1,2);

Or, something that can just set qual_okay=false, given a query such
as: select * from foo where a in (1,2,3) and a < 5;

This is clearly doable by reusing the binary search code during
preprocessing. The first example transformation could work via a
binary search for the constant "3", followed by a memmove() to shrink
the array in-place (plus the inequality itself would need to be fully
eliminated). The second example would work a little like the array
merging thing that the patch has already, except that there'd only be
one array involved (there wouldn't be a pair of arrays).

select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;

Maybe we need to do better with that. What do you think?

Let me come back to that when I'm done reviewing the final part of nbtutils.

Even if this case doesn't matter (which I now doubt), it's probably
easier to just get it right than it would be to figure out if we can
live without it.

void
_bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int i;

+ Assert(so->numArrayKeys);
+ Assert(so->qual_ok);

Has the requirement for a known scan direction been removed, or should
this still have an Assert(dir != NoMovementScanDirection)?

I agree that such an assertion is worth having. Added that locally.

I'm not sure that comment is correct, at least it isn't as clear as
can be. Maybe something more in the line of the following?
+ pstate.prechecked = false; /* prechecked didn't cover HIKEY */

I agree that that's a little better than what I had.

+++ b/src/backend/access/nbtree/nbtutils.c
@@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
+        elemtype = cur->sk_subtype;
+        if (elemtype == InvalidOid)
+            elemtype = rel->rd_opcintype[cur->sk_attno - 1];
Should we Assert() that this elemtype matches the array element type
ARR_ELEMTYPE(arrayval) used to deconstruct the array?

Yeah, good idea.

This is not "a" previous equality key, but "the" previous equality
operator array scan key.
Do we want to expend some additional cycles for detecting duplicate
equality array key types in interleaved order like =int[] =bigint[]
=int[]? I don't think it would be very expensive considering the
limited number of cross-type equality operators registered in default
PostgreSQL, so a simple loop that checks matching element types
starting at the first array key scankey for this attribute should
suffice. We could even sort the keys by element type if we wanted to
fix any O(n) issues for this type of behaviour (though this is
_extremely_ unlikely to become a performance issue).

Yes, I think that we should probably have this (though likely wouldn't
bother sorting the scan keys themselves).

You'd need to look-up cross-type operators for this. They could
possibly fail, even when nothing else fails, because in principle you
could have 3 types involved for only 2 scan keys: the opclass/on-disk
type, plus 2 separate types. We could fail to find a cross-type
operator for our "2 separate types", while still succeeding in finding
2 cross type operators for each of the 2 separate scan keys (each
paired up the indexed column).

When somebody writes a silly query with such obvious redundancies,
there needs to be a sense of proportion about it. Fixed preprocessing
costs don't seem that important. What's important is that we make a
reasonable effort to avoid horrible runtime performance when it can be
avoided, such as scanning the whole index. (If a user has a complaint
about added cycles during preprocessing, then the fix for that is to
just not write silly queries. I care much less about added cycles if
they have only a fixed, small-ish added cost, that isn't borne by
sensibly-written queries.)

+             * qual is unsatisfiable
+             */
+            if (num_elems == 0)
+            {
+                so->qual_ok = false;
+                return NULL;
+            }
I think there's a memory context switch back to oldContext missing
here, as well as above at `if (num_nonnulls == 0)`. These were
probably introduced by changing from break to return; and the paths
aren't yet exercised in regression tests.

I agree that this is buggy. But it doesn't seem to actually fail in
practice (it "accidentally fails to fail"). In practice, the whole
index scan is shut down before anything breaks anyway.

I mention this only because it's worth understanding that I do have
coverage for this case (at least in my own expansive test suite). It
just wasn't enough to detect this particular oversight.

+++ b/src/backend/utils/adt/selfuncs.c
has various changes in comments that result in spelling and style issues:

+ * primitive index scans that will be performed for caller

+ * primitive index scans that will be performed for the caller.
(missing "the", period)

I'll just keep the previous wording and punctuation, with one
difference: "index scans" will be changed to "primitive index scans".

-         * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-         * since that's internal to the indexscan.)
+         * share for each outer scan

The trailing period was lost:
+ * share for each outer scan.

I'm not in the habit of including a period/full stop after a
standalone sentence (actually, I'm in the habit of *not* doing so,
specifically). But in the interest of consistency with surrounding
code (and because it doesn't really matter), I'll add one back here.

--
Peter Geoghegan

#50

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#47)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Mar 6, 2024 at 4:51 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

To clarify, what I mean here is that merging the changes of both the
SAOPs changes and the removal of arrayKeyData seems to increase the
patch size and increases the maximum complexity of each component
patch's review.

Removing arrayKeyData probably makes the patch very slightly smaller,
actually. But even if it's really the other way around, I'd still like
to get rid of it as part of the same commit as everything else. It
just makes sense that way.

Separate patches may make this more reviewable, or not, but no comment
was given on why it is better to merge the changes into a single
patch.

Fair enough. Here's why:

The original SAOP design (commit 9e8da0f7, "Teach btree to handle
ScalarArrayOpExpr quals natively") added a layer of indirection
between scan->keyData (input scan keys) and so->keyData (output scan
keys): it added another scan key array, so->arrayKeyData. There was
array-specific preprocessing in _bt_preprocess_array_keys, that
happened before the first primitive index scan even began -- that
transformed our true input scan keys (scan->keyData) into a copy of
the array with limited amounts of array-specific preprocessing already
performed (so->arrayKeyData).

This made a certain amount of sense at the time, because
_bt_preprocess_keys was intended to be called once per primitive index
scan. Kind of like the inner side of a nested loop join's inner index
scan, where we call _bt_preprocess_keys once per inner-side
scan/btrescan call. (Actually, Tom's design has us call _bt_preprocess
once per primitive index scan per btrescan call, which might matter in
those rare cases where the inner side of a nestloop join had SAOP
quals.)

What I now propose to do is to just call _bt_preprocess_keys once per
btrescan (actually, it's still called once per primitive index scan,
but all calls after the first are just no-ops after v12 of the patch).
This makes sense because SAOP array constants aren't like nestloop
joins with an inner index scans, in one important way: we really can
see everything up-front. We can see all of the array elements, and
operate on whole arrays as necessary during preprocessing (e.g.,
performing the array merging thing I added to
_bt_preprocess_array_keys).

It's not like the next array element is only visible to prepocessing
only after the outer side of a nestloop join runs, and next calls
btrescan -- so why treat it like that? Conceptually, "WHERE a = 1" is
almost the same thing as "WHERE a = any('{1,2}')", so why not use
essentially the same approach to preprocessing in both cases? We don't
need to copy the input keys into so->arrayKeyData, because the
indirection (which is a bit like a fake nested loop join) doesn't buy
us anything.

v13 of the patch doesn't quite 100% eliminate so->arrayKeyData. While
it removes the arrayKeyData field from the BTScanOpaqueData struct, we
still use a temporary array (accessed via a pointer that's just a
local variable) that's also called arrayKeyData. And that also stores
array-preprocessing-only copies of the input scan keys. That happens
within _bt_preprocess_keys.

So we do "still need arrayKeyData", but we only need it for a brief
period at the start of the index scan. It just doesn't make any sense
to keep it around for longer than that, in a world where
_bt_preprocess_keys "operates directly on arrays". That only made
sense (a bit of sense) back when _bt_preprocess_keys was subordinate
to _bt_preprocess_array_keys, but it's kinda the other way around now.
We could probably even get rid of this remaining limited form of
arrayKeyData, but that doesn't seem like it would add much.

--
Peter Geoghegan

#51

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Matthias van de Meent (#46)

3 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote:

Are you still interested in working directly on the preprocessing
stuff?

If you mean my proposed change to merging two equality AOPs, then yes.
I'll try to fit it in tomorrow with the rest of the review.

I've attached v14, where 0001 is v13, 0002 is a patch with small
changes + some large comment suggestions, and 0003 which contains
sorted merge join code for _bt_merge_arrays.

I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I
can make of it.

I have a feeling that I was slightly too optimistic about how
likely we were to be able to get away with not having certain kinds of
array preprocessing, back when I posted v12. It's true that the
propensity of the patch to not recognize "partial
redundancies/contradictions" hardly matters with redundant equalities,
but inequalities are another matter. I'm slightly worried about cases
like this one:

select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;

Maybe we need to do better with that. What do you think?

Let me come back to that when I'm done reviewing the final part of nbtutils.

Attached is v13.

It looks like there are few issues remaining outside the changes in
nbtutils. I've reviewed the changes to those files, and ~half of
nbtutils (up to _bt_advance_array_keys_increment) now. I think I can
get the remainder done tomorrow.

+_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)

[...]

+        if (!(cur->sk_flags & SK_SEARCHARRAY) &&
+            cur->sk_strategy != BTEqualStrategyNumber)
+            continue;

This should use ||, not &&: if it's not an array, or not an equality
array key, it's not using an array key slot and we're not interested.
Note that _bt_verify_arrays_bt_first does have the right condition already.

+        if (readpagetup || result != 0)
+        {
+            Assert(result != 0);
+            return false;
+        }

I'm confused about this. By asserting !readpagetup after this exit, we
could save a branch condition for the !readpagetup result != 0 path.
Can't we better assert the inverse just below, or is this specifically
for defense-in-depth against bug? E.g.

+        if (result != 0)
+            return false;
+
+        Assert(!readpagetup);

+    /*
+     * By here we have established that the scan's required arrays were
+     * advanced, and that they haven't become exhausted.
+     */
+    Assert(arrays_advanced || !arrays_exhausted);

Should use &&, based on the comment.

+     * We generally permit primitive index scans to continue onto the next
+     * sibling page when the page's finaltup satisfies all required scan keys
+     * at the point where we're between pages.

This should probably describe that we permit primitive scans with
array keys to continue until we get to the sibling page, rather than
this rather obvious and generic statement that would cover even the
index scan for id > 0 ORDER BY id asc; or this paragraph can be
removed.

+    if (!all_required_satisfied && pstate->finaltup &&
+        _bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, false, 0,
+                                     &so->scanBehind))
+        goto new_prim_scan;

+_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)

[...]

+        if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+            ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+            continue;

I think a simple check if any SK_BT_REQ flag is set should be OK here:
The key must be an equality key, and those must be required either in
both directions, or in neither direction.

-----

Further notes:

I have yet to fully grasp what so->scanBehind is supposed to mean. "/*
Scan might be behind arrays? */" doesn't give me enough hints here.

I find it weird that we call _bt_advance_array_keys for non-required
sktrig. Shouldn't that be as easy as doing a binary search through the
array? Why does this need to hit the difficult path?

Kind regards,

Matthias van de Meent

Attachments:

v14-0003-_bt_merge_arrays-Use-sort-merge-join-algorithm.patchapplication/octet-stream; name=v14-0003-_bt_merge_arrays-Use-sort-merge-join-algorithm.patchDownload

From 1d5b556c594f689186e506c8507cc70af6316fb3 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Mar 2024 14:51:29 +0100
Subject: [PATCH v14 3/3] _bt_merge_arrays: Use sort-merge join algorithm

This reduces the worst case performance from O(N log (M)) random operations
to O(N+M) mostly sequential accesses.

Future work may want to implement an exponential increment in the
inequality branches for best-case performance wins, but that's yet to
be implemented.
---
 src/backend/access/nbtree/nbtutils.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 6e97455c87..d4e346de9d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -684,25 +684,41 @@ _bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 #endif
 
 	/*
-	 * Incrementally copy the original array into a temp buffer, skipping over
-	 * any items that are missing from the "next" array
+	 * Sort-merge join the two arrays.
+	 * The arrays are presorted using the same operators, so we can just use
+	 * two cursors 
 	 */
 	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
-	for (int i = 0; i < nelems_orig; i++)
+	for (int i = 0, j = 0; i < nelems_orig && j < nelems_next;)
 	{
 		Datum	   *elem = elems_orig + i;
+		Datum	   *next = elems_next + j;
+		int			res;
 
-		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
-						_bt_compare_array_elements, &cxt))
+		res = _bt_compare_array_elements(elem, next, &cxt);
+
+		if (res == 0)
 		{
 			elems_orig[merged_nelems] = *elem;
 #ifdef USE_ASSERT_CHECKING
 			merged[merged_nelems] = *elem;
 #endif
 			merged_nelems++;
+			i++;
+			j++;
 		}
+		/*
+		 * XXX: Future efforts may want to use exponential search to find the
+		 * next candidate i or j value, rather than the linear scan used here,
+		 * as it would allow best-case performance to improve from O(N) to
+		 * O(log(N)).
+		 */
+		else if (res < 0)
+			i++;
+		else /* res > 0 */
+			j++;
 	}
 
 #ifdef USE_ASSERT_CHECKING
-- 
2.40.1

v14-0002-Notes-small-fixes-for-nbtree-SAOP-patch.patchapplication/octet-stream; name=v14-0002-Notes-small-fixes-for-nbtree-SAOP-patch.patchDownload

From 9f9ce415a5f66d54fce627103c4fc73289d430cf Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 8 Mar 2024 14:02:19 +0100
Subject: [PATCH v14 2/3] Notes & small fixes for nbtree SAOP patch

- Comments fixed
- Assertions updated
- improve _bt_merge_arrays to only alloc when assertions are enabled.
---
 src/backend/access/nbtree/nbtsearch.c |  2 +-
 src/backend/access/nbtree/nbtutils.c  | 96 +++++++++++++++++++++++++--
 2 files changed, 93 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 68bc32c6e1..2b61bda523 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1763,7 +1763,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			pstate.prechecked = false;	/* prechecked earlier tuple */
+			pstate.prechecked = false;	/* prechecked didn't cover HIKEY */
 			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 08219001a6..6e97455c87 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -678,8 +678,10 @@ _bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 				 Datum *elems_next, int nelems_next)
 {
 	BTSortArrayContext cxt;
-	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
 	int			merged_nelems = 0;
+#ifdef USE_ASSERT_CHECKING
+	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+#endif
 
 	/*
 	 * Incrementally copy the original array into a temp buffer, skipping over
@@ -694,15 +696,23 @@ _bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 
 		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
 						_bt_compare_array_elements, &cxt))
-			merged[merged_nelems++] = *elem;
+		{
+			elems_orig[merged_nelems] = *elem;
+#ifdef USE_ASSERT_CHECKING
+			merged[merged_nelems] = *elem;
+#endif
+			merged_nelems++;
+		}
 	}
 
+#ifdef USE_ASSERT_CHECKING
 	/*
 	 * Overwrite the original array with temp buffer so that we're only left
 	 * with intersecting array elements
 	 */
-	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	Assert(memcmp(elems_orig, merged, merged_nelems * sizeof(Datum)) == 0);
 	pfree(merged);
+#endif
 
 	return merged_nelems;
 }
@@ -1396,6 +1406,48 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 
 	so->scanBehind = false;		/* reset */
 
+	/*
+	 * Phase 1: Move array keys to more-appropriate positions
+	 * 
+	 * We move the trigger scankey (i.e. the first non-matching scankey that
+	 * the caller found) to the first appropriate array key value. This
+	 * has 6 possible options:
+	 * 
+	 * The scankey is a required equality array key, and
+	 *	 1. The tuple's attribute value is present in the array key
+	 *	 2. The tuple's attribute value is not present in the array key, but
+	 *		the array has a larger value.
+	 *	 3. The tuple's attribute value is not present in the array key, and
+	 *		no larger value exists in the array
+	 * The scankey is not a required equality array key (so: non-required, or
+	 * inequality key, or non-array)
+	 *	 4. The scankey is required and does match the tuple.
+	 *	 5. The scankey is required and does not match the tuple.
+	 *	 6. The scankey is not required.
+	 *
+	 * In case 1, we just update that key, and continue with the next key as
+	 * if that were the next trigger scankey.
+	 * 
+	 * In case 2, we update this array key to the larger value, and set all
+	 * subsequent array keys to their first value.
+	 * 
+	 * In case 3 and 5, we set beyond_end_advance to true, and we set all
+	 * array scankeys starting from this one to their last value.  Phase 2
+	 * will handle the array wraparounds triggered by beyond_end_advance.
+	 * 
+	 * In case 4, we just continue on to the next key.
+	 * 
+	 * In case 6, we ignore the key for the purposes of this system.
+	 * 
+	 * We also gather several flags used later:
+	 *	 - arrays_advanced: Did we update any array keys yet?
+	 *		This is used mostly to detect whether we've found the end of the
+	 *		index scan.
+	 *	 - sktrig_required: Was the trigger scankey a required scankey?
+	 *	 - all_required_satisfied: Could we apply case 1 for all keys?
+	 *	 - so->scanBehind: XXX
+	 *	 - XXX: Document more flags?
+	 */
 	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
 	{
 		ScanKey		cur = so->keyData + ikey;
@@ -1419,6 +1471,12 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 		}
 		else
 		{
+			/*
+			 * This is an inequality key, so this should never be required in
+			 * both directions.  However, let's try to be very sure about that.
+			 */
+			Assert((cur->sk_flags & (SK_BT_REQBKWD|SK_BT_REQFWD)) !=
+				   (SK_BT_REQBKWD|SK_BT_REQFWD));
 			/*
 			 * Are any inequalities required in the opposite direction only
 			 * present here?
@@ -1687,11 +1745,20 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 		}
 	}
 
+	/*
+	 * Phase 2: post-processing
+	 */
+	
 	/*
 	 * Consider if we need to advance the array keys incrementally to finish
 	 * off "beyond end of array element" array advancement.  This is the only
 	 * way that the array keys can be exhausted, which is how top-level index
 	 * scans usually determine that they've run out of tuples to return.
+	 *
+	 * This is similar to adding 1 to 999 and carrying the digits up to 1000:
+	 * If we try to advance an array key that's at it's last element, we wrap
+	 * it around and increment the preceding array key by one. If that too is
+	 * at its last element, wrap that, and increment the one before, etc.
 	 */
 	arrays_exhausted = false;
 	if (beyond_end_advance)
@@ -1765,6 +1832,27 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 		 * here will advance the array keys to the maximum possible extent
 		 * that we can know to be safe based on caller's tuple alone.  If we
 		 * didn't perform this step, then that guarantee wouldn't quite hold.
+		 *
+		 * Example:
+		 * 
+		 * If we were called with the following arguments:
+		 *		ScanKey (a=1 [1,2,3], b < 3)
+		 *		Tuple    a=2, b=4
+		 * Then by now, the ScanKey would look like this:
+		 *		ScanKey (a=2 [1,2,3], b<3)
+		 * This doesn't match the tuple (4 is not less than 3), and also won't
+		 * match more tuples in the scan direction.  The required unsatisfied
+		 * inequality condition "b<3" was previously hidden by the unsatisfied
+		 * array key, but now indicates we should update the array keys again.
+		 * After another round of _bt_advance_array_keys, the key will look
+		 * like this, and thus will be ready for further use:
+		 *		ScanKey (a=3 [1,2,3], b<3)
+		 *
+		 * Note that we will only ever need one additional advancement:  The
+		 * key is already advanced and at least as large as the current tuple
+		 * (after accounting for the scan direction). Advancement will further
+		 * increase this; this is only preparation so that the next scan may
+		 * hit matching tuples (rather than being guaranteed to not match).
 		 */
 		if (unlikely(!pstate->continuescan))
 		{
@@ -1821,7 +1909,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 	 * By here we have established that the scan's required arrays were
 	 * advanced, and that they haven't become exhausted.
 	 */
-	Assert(arrays_advanced || !arrays_exhausted);
+	Assert(arrays_advanced && !arrays_exhausted);
 
 	/*
 	 * We generally permit primitive index scans to continue onto the next
-- 
2.40.1

v14-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v14-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 7a41b753398c73ca7599fa50d81f6bcf1cc5a5b6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v14 1/3] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with any out-of-core
amcanorder-based index AMs that coevolved with nbtree.  Such an index AM
could have had similar limitations around SOAP execution, and so could
have come to rely on the planner workarounds removed by this commit.
Although it seems unlikely that such an index AM really exists, it still
warrants a pro forma compatibility item in the release notes.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 doc/src/sgml/monitoring.sgml               |   15 +
 src/backend/access/nbtree/nbtree.c         |  113 +-
 src/backend/access/nbtree/nbtsearch.c      |  174 +-
 src/backend/access/nbtree/nbtutils.c       | 2380 +++++++++++++++++---
 src/backend/optimizer/path/indxpath.c      |   90 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 src/include/access/nbtree.h                |   53 +-
 src/test/regress/expected/create_index.out |   33 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   12 +-
 10 files changed, 2416 insertions(+), 581 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8aca08140e..b4c66abc63 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4063,6 +4063,21 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values might perform
+    multiple <quote>primitive</quote> index scans (up to one primitive scan
+    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
+    for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 41df1027d2..667121f5ce 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -46,8 +46,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -67,8 +67,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -204,21 +204,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
-	/*
-	 * If we have any array keys, initialize them during first call for a
-	 * scan.  We can't do this in btrescan because we don't know the scan
-	 * direction at that time.
-	 */
-	if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return false;
-
-		_bt_start_array_keys(scan, dir);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -260,8 +246,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
 	return res;
 }
@@ -276,19 +262,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
-	/*
-	 * If we have any array keys, initialize them.
-	 */
-	if (so->numArrayKeys)
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return ntids;
-
-		_bt_start_array_keys(scan, ForwardScanDirection);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -318,8 +292,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -348,10 +322,13 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	else
 		so->keyData = NULL;
 
-	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->scanBehind = false;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
+	so->keyDataMap = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -391,7 +368,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->advanceDir = NoMovementScanDirection;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -425,9 +404,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
-
-	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
-	_bt_preprocess_array_keys(scan);
 }
 
 /*
@@ -455,7 +431,7 @@ btendscan(IndexScanDesc scan)
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
-	/* so->arrayKeyData and so->arrayKeys are in arrayContext */
+	/* so->arrayKeys is in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
 	if (so->killedItems != NULL)
@@ -490,10 +466,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -504,10 +476,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -546,6 +514,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Rewind the scan's array keys, if any */
+			if (so->numArrayKeys)
+				_bt_rewind_array_keys(scan);
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -572,7 +543,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -598,7 +569,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -609,7 +580,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -640,16 +615,16 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  The top-level index
+			 * scan might require additional primitive index scans.
 			 */
 			status = false;
 		}
@@ -681,9 +656,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -717,12 +695,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_start_prim_scan.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -736,14 +713,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -752,13 +729,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 63ee9ba225..68bc32c6e1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,11 +907,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
 
+	if (so->numArrayKeys)
+	{
+		if (ScanDirectionIsNoMovement(so->advanceDir))
+		{
+			/*
+			 * First primitive index scan (for current btrescan).
+			 *
+			 * Initialize arrays, and the corresponding scan keys that were
+			 * just output by _bt_preprocess_keys.
+			 */
+			_bt_start_array_keys(scan, dir);
+		}
+		else
+		{
+			/*
+			 * Just stick with the array keys set by _bt_checkkeys at the end
+			 * of the previous primitive index scan.
+			 *
+			 * Note: The initial primitive index scan's _bt_preprocess_keys
+			 * call actually outputs new keys.  Later calls are just no-ops.
+			 * We're just here to build an insertion scan key using values
+			 * already set in so->keyData[] by _bt_checkkeys.
+			 */
+		}
+		so->advanceDir = dir;
+	}
+
 	/*
 	 * For parallel scans, get the starting page from shared state. If the
 	 * scan has not started, proceed to find out first leaf page in the usual
@@ -1527,11 +1554,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
-	bool		continuescanPrechecked;
-	bool		haveFirstMatch = false;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1551,8 +1577,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.prechecked = false;
+	pstate.firstmatch = false;
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	arrayKeys = so->numArrayKeys != 0;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1598,10 +1630,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * corresponding value from the last item on the page.  So checking with
 	 * the last item on the page would give a more precise answer.
 	 *
-	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * We skip this for the scan's first page to avoid slowing down point
+	 * queries.  We also have to avoid applying the optimization in rare cases
+	 * where it's not yet clear that the scan is at or ahead of its current
+	 * array keys.  If we're behind, but not too far behind (the start of
+	 * tuples matching the current keys is somewhere before the last item),
+	 * then the optimization is unsafe.
+	 *
+	 * Cases with multiple distinct sets of required array keys for key space
+	 * from the same leaf page can _attempt_ to use the precheck optimization,
+	 * though.  It won't work out, but there's no better way of figuring that
+	 * out than just optimistically attempting the precheck.
+	 *
+	 * The array keys safety issue is related to our reliance on _bt_first
+	 * passing us an offnum that's exactly at the beginning of where equal
+	 * tuples are to be found.  The underlying problem is that we have no
+	 * built-in ability to tell the difference between the start of required
+	 * equality matches and the end of required equality matches.  Array key
+	 * advancement within _bt_checkkeys has to act as a "_bt_first surrogate"
+	 * whenever the start of tuples matching the next set of array keys is
+	 * close to the end of tuples matching the current/last set of array keys.
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !so->scanBehind && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1660,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Do the precheck, while avoiding advancing the scan's array keys
+		 * prematurely
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, false, itup, indnatts);
+		pstate.prechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,23 +1702,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1696,7 +1739,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1756,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			pstate.prechecked = false;	/* prechecked earlier tuple */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1777,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys && minoff <= maxoff && !P_LEFTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1772,23 +1824,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1824,7 +1866,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1999,6 +2041,21 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the right
+		 */
+		if (ScanDirectionIsBackward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreRight = true;
+			so->advanceDir = dir;
+			so->scanBehind = false;
+			so->needPrimScan = false;
+		}
+
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 	}
@@ -2007,6 +2064,21 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* Remember we left a page with data */
 		so->currPos.moreRight = true;
 
+		/*
+		 * If the scan direction changed since our array keys (if any) last
+		 * advanced, we cannot trust _bt_readpage's determination that there
+		 * are no matches to be found to the left
+		 */
+		if (ScanDirectionIsForward(so->advanceDir))
+		{
+			Assert(so->numArrayKeys);
+
+			so->currPos.moreLeft = true;
+			so->advanceDir = dir;
+			so->scanBehind = false;
+			so->needPrimScan = false;
+		}
+
 		if (scan->parallel_scan != NULL)
 		{
 			/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d50317096d..08219001a6 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -32,23 +32,57 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *sortproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
 
+typedef struct ScanKeyAttr
+{
+	ScanKey		skey;
+	int			ikey;
+} ScanKeyAttr;
+
+static void _bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+								FmgrInfo *orderproc, FmgrInfo **sortprocp);
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
-									  StrategyNumber strat,
+									  Oid elemtype, StrategyNumber strat,
 									  Datum *elems, int nelems);
-static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-									bool reverse,
-									Datum *elems, int nelems);
+static int	_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc,
+									bool reverse, Datum *elems, int nelems);
+static int	_bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+										 IndexTuple tuple, bool readpagetup,
+										 int sktrig, bool *scanBehind);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int sktrig);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool arrayKeys, bool prechecked, bool firstmatch,
+							  bool *continuescan, int *ikey);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -188,29 +222,41 @@ _bt_freestack(BTStack stack)
  *
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
- * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
- *
- * Note: the reason we need so->arrayKeyData, rather than just scribbling
- * on scan->keyData, is that callers are permitted to call btrescan without
- * supplying a new set of scankey data.
+ * Return modified scan keys as input for further, standard preprocessing.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array key as redundant.  Similarly, we are capable of "merging
+ * together" multiple equality array keys (from two or more input scan keys)
+ * into a single output scan key that contains only the intersecting array
+ * elements.  This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.  It can also allow us to
+ * detect contradictory quals early.
+ *
+ * Note: the reason we need to return a temp scan key array, rather than just
+ * scribbling on scan->keyData, is that callers are permitted to call btrescan
+ * without supplying a new set of scankey data.
  */
-void
+static ScanKey
 _bt_preprocess_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
 	int			numberOfKeys = scan->numberOfKeys;
-	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16	   *indoption = rel->rd_indoption;
 	int			numArrayKeys;
+	int			prevArrayAtt = -1;
+	Oid			prevElemtype = InvalidOid;
 	ScanKey		cur;
-	int			i;
 	MemoryContext oldContext;
+	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+
+	Assert(numberOfKeys && so->advanceDir == NoMovementScanDirection);
 
 	/* Quick check to see if there are any array keys */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
 		cur = &scan->keyData[i];
 		if (cur->sk_flags & SK_SEARCHARRAY)
@@ -220,20 +266,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			/* If any arrays are null as a whole, we can quit right now. */
 			if (cur->sk_flags & SK_ISNULL)
 			{
-				so->numArrayKeys = -1;
-				so->arrayKeyData = NULL;
-				return;
+				so->qual_ok = false;
+				return NULL;
 			}
 		}
 	}
 
 	/* Quit if nothing to do. */
 	if (numArrayKeys == 0)
-	{
-		so->numArrayKeys = 0;
-		so->arrayKeyData = NULL;
-		return;
-	}
+		return NULL;
 
 	/*
 	 * Make a scan-lifespan context to hold array-associated data, or reset it
@@ -249,18 +290,24 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	oldContext = MemoryContextSwitchTo(so->arrayContext);
 
 	/* Create modifiable copy of scan->keyData in the workspace context */
-	so->arrayKeyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-	memcpy(so->arrayKeyData,
-		   scan->keyData,
-		   scan->numberOfKeys * sizeof(ScanKeyData));
+	arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
+	memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
 
 	/* Allocate space for per-array data in the workspace context */
-	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
+
+	/* Allocate space for ORDER procs used during array binary searches */
+	so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+	so->keyDataMap = (int *) palloc(numberOfKeys * sizeof(int));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
+		FmgrInfo	sortproc;
+		FmgrInfo   *sortprocp = &sortproc;
+		bool		reverse;
+		Oid			elemtype;
 		ArrayType  *arrayval;
 		int16		elmlen;
 		bool		elmbyval;
@@ -271,7 +318,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			num_nonnulls;
 		int			j;
 
-		cur = &so->arrayKeyData[i];
+		cur = &arrayKeyData[i];
+		reverse = (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0;
+
+		/*
+		 * Determine the nominal datatype of the array elements.  We have to
+		 * support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this is a hack to simplify life for
+		 * ScanKeyInit().
+		 */
+		elemtype = cur->sk_subtype;
+		if (elemtype == InvalidOid)
+			elemtype = rel->rd_opcintype[cur->sk_attno - 1];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way ORDER proc to perform binary
+		 * searches for the next matching array element.  Set that up now.
+		 *
+		 * Array scan keys with cross-type equality operators will require a
+		 * separate same-type ORDER proc for sorting their array.  Otherwise,
+		 * sortproc just points to the same proc used during binary searches.
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+			_bt_setup_array_cmp(scan, cur, elemtype,
+								&so->orderProcs[i], &sortprocp);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -305,8 +377,8 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/* If there's no non-nulls, the scan qual is unsatisfiable */
 		if (num_nonnulls == 0)
 		{
-			numArrayKeys = -1;
-			break;
+			so->qual_ok = false;
+			return NULL;
 		}
 
 		/*
@@ -319,7 +391,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTLessStrategyNumber:
 			case BTLessEqualStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTGreaterStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -329,7 +401,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTGreaterEqualStrategyNumber:
 			case BTGreaterStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTLessStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -342,24 +414,163 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/*
 		 * Sort the non-null elements and eliminate any duplicates.  We must
 		 * sort in the same ordering used by the index column, so that the
-		 * successive primitive indexscans produce data in index order.
+		 * arrays can be advanced in lockstep with the scan's progress through
+		 * the index's key space.
 		 */
-		num_elems = _bt_sort_array_elements(scan, cur,
-											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+		Assert(cur->sk_strategy == BTEqualStrategyNumber);
+		num_elems = _bt_sort_array_elements(cur, sortprocp, reverse,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and whole scan keys).
+		 *
+		 * We don't support merging arrays (for same-attribute scankeys) when
+		 * the array element types don't match.  Note that this is orthogonal
+		 * to whether cross-type operators are used (whether the element type
+		 * matches or fails to match the on-disk/opclass type is irrelevant).
+		 */
+		if (prevArrayAtt == cur->sk_attno && prevElemtype == elemtype)
+		{
+			BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+			Assert(arrayKeyData[prev->scan_key].sk_attno == cur->sk_attno);
+			Assert(arrayKeyData[prev->scan_key].sk_func.fn_oid ==
+				   cur->sk_func.fn_oid);
+			Assert(arrayKeyData[prev->scan_key].sk_collation ==
+				   cur->sk_collation);
+
+			num_elems = _bt_merge_arrays(cur, sortprocp, reverse,
+										 prev->elem_values, prev->num_elems,
+										 elem_values, num_elems);
+
+			pfree(elem_values);
+
+			/*
+			 * If there are no intersecting elements left from merging this
+			 * array into the previous array on the same attribute, the scan
+			 * qual is unsatisfiable
+			 */
+			if (num_elems == 0)
+			{
+				so->qual_ok = false;
+				return NULL;
+			}
+
+			/*
+			 * Lower the number of elements from the previous array.  This
+			 * scan key/array is redundant.  Dealing with that is finalized
+			 * within _bt_preprocess_keys.
+			 */
+			prev->num_elems = num_elems;
+			cur->sk_strategy = InvalidStrategy; /* for _bt_preprocess_keys */
+			continue;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
 		 */
-		so->arrayKeys[numArrayKeys].scan_key = i;
+		so->arrayKeys[numArrayKeys].scan_key = i;	/* will be adjusted later */
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		prevArrayAtt = cur->sk_attno;
+		prevElemtype = elemtype;
 	}
 
 	so->numArrayKeys = numArrayKeys;
 
 	MemoryContextSwitchTo(oldContext);
+
+	return arrayKeyData;
+}
+
+/*
+ * _bt_setup_array_cmp() -- Set up array comparison functions
+ *
+ * Sets ORDER proc in caller's orderproc argument, which is used during binary
+ * searches of arrays during the index scan.  Also sets a same-type ORDER proc
+ * in caller's *sortprocp argument.
+ *
+ * Caller should pass an orderproc pointing to space that'll store the ORDER
+ * proc for the scan, and a *sortprocp pointing to its own separate space.
+ *
+ * In the common case where we don't need to deal with cross-type operators,
+ * only one ORDER proc is actually required by caller.  We'll set *sortprocp
+ * to point to the same memory that caller's orderproc continues to point to.
+ * Otherwise, *sortprocp will continue to point to separate memory, which
+ * we'll initialize separately (with an "(elemtype, elemtype)" ORDER proc that
+ * can be used to sort arrays).
+ *
+ * Array preprocessing calls here with all equality strategy scan keys,
+ * including any that don't use an array at all.  See _bt_advance_array_keys
+ * for an explanation of why we need to treat these as degenerate single-value
+ * arrays when the scan advances its arrays.
+ */
+static void
+_bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+					FmgrInfo *orderproc, FmgrInfo **sortprocp)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	RegProcedure cmp_proc;
+	Oid			opclasstype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
+
+	/*
+	 * Look up the appropriate comparison function in the opfamily.  This must
+	 * use the opclass type as its left hand arg type, and the array element
+	 * as its right hand arg type (since binary searches search for the array
+	 * value that best matches the next on-disk index tuple for the scan).
+	 *
+	 * Note: it's possible that this would fail, if the opfamily lacks the
+	 * required cross-type ORDER proc.  But this is no different to the case
+	 * where _bt_first fails to find an ORDER proc for its insertion scan key.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 opclasstype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, opclasstype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+
+	if (opclasstype == elemtype || !(skey->sk_flags & SK_SEARCHARRAY))
+	{
+		/*
+		 * A second opfamily support proc lookup can be avoided in the common
+		 * case where the ORDER proc used for the scan's binary searches uses
+		 * the opclass/on-disk datatype for both its left and right arguments.
+		 *
+		 * Also avoid a separate lookup whenever scan key lacks an array.
+		 * There is nothing for caller to sort anyway, but be consistent.
+		 */
+		*sortprocp = orderproc;
+		return;
+	}
+
+	/*
+	 * Look up the appropriate same-type comparison function in the opfamily.
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but it seems quite unlikely that an opfamily would omit
+	 * non-cross-type support functions for any datatype that it supports at
+	 * all.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 elemtype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, elemtype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set same-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, *sortprocp, so->arrayContext);
 }
 
 /*
@@ -370,27 +581,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
  * least element, or BTGreaterStrategyNumber to get the greatest.
  */
 static Datum
-_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
+_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey, Oid elemtype,
 						 StrategyNumber strat,
 						 Datum *elems, int nelems)
 {
 	Relation	rel = scan->indexRelation;
-	Oid			elemtype,
-				cmp_op;
+	Oid			cmp_op;
 	RegProcedure cmp_proc;
 	FmgrInfo	flinfo;
 	Datum		result;
 	int			i;
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
 	/*
 	 * Look up the appropriate comparison operator in the opfamily.
 	 *
@@ -399,6 +600,8 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	 * non-cross-type comparison operators for any datatype that it supports
 	 * at all.
 	 */
+	Assert(skey->sk_strategy != BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
 	cmp_op = get_opfamily_member(rel->rd_opfamily[skey->sk_attno - 1],
 								 elemtype,
 								 elemtype,
@@ -433,50 +636,26 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
  * The array elements are sorted in-place, and the new number of elements
  * after duplicate removal is returned.
  *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * skey identifies the index column whose opfamily determines the comparison
+ * semantics, and sortproc is a corresponding ORDER proc.  If reverse is true,
+ * we sort in descending order.
+ *
+ * Note: sortproc arg must be an ORDER proc suitable for sorting: it must
+ * compare arguments that are both of the same type as the array elements
+ * being sorted (even during scans that perform binary searches against the
+ * arrays using distinct cross-type ORDER procs).
  */
 static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
+_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -487,6 +666,47 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(ScanKey skey, FmgrInfo *sortproc, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTSortArrayContext cxt;
+	Datum	   *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+	int			merged_nelems = 0;
+
+	/*
+	 * Incrementally copy the original array into a temp buffer, skipping over
+	 * any items that are missing from the "next" array
+	 */
+	cxt.sortproc = sortproc;
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0; i < nelems_orig; i++)
+	{
+		Datum	   *elem = elems_orig + i;
+
+		if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+						_bt_compare_array_elements, &cxt))
+			merged[merged_nelems++] = *elem;
+	}
+
+	/*
+	 * Overwrite the original array with temp buffer so that we're only left
+	 * with intersecting array elements
+	 */
+	memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+	pfree(merged);
+
+	return merged_nelems;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -498,7 +718,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->sortproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -506,11 +726,160 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound for backward
+ * scans).  It's safe for searches against required scan key arrays to reuse
+ * earlier search bounds like this because such arrays always advance in
+ * lockstep with the index scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem = 0,
+				mid_elem = -1,
+				high_elem = array->num_elems - 1,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur_elem_start)
+	{
+		if (ScanDirectionIsForward(dir))
+			low_elem = array->cur_elem;
+		else
+			high_elem = array->cur_elem;
+	}
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * It's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the low_elem datum.  Must always set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
  * Set up the cur_elem counters and fill in the first sk_argument value for
- * each array scankey.  We can't do this until we know the scan direction.
+ * each array scankey.
  */
 void
 _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -518,159 +887,1163 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			i;
 
+	Assert(so->numArrayKeys);
+	Assert(so->qual_ok);
+
 	for (i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 
 		Assert(curArrayKey->num_elems > 0);
+		Assert(skey->sk_flags & SK_SEARCHARRAY);
+
 		if (ScanDirectionIsBackward(dir))
 			curArrayKey->cur_elem = curArrayKey->num_elems - 1;
 		else
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
 
 	/*
-	 * We must advance the last array key most quickly, since it will
-	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * We must advance the last array key most quickly, since it will
+	 * correspond to the lowest-order index column among the available
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
+	 */
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
+		int			cur_elem = curArrayKey->cur_elem;
+		int			num_elems = curArrayKey->num_elems;
+		bool		rolled = false;
+
+		if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
+		{
+			cur_elem = 0;
+			rolled = true;
+		}
+		else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
+		{
+			cur_elem = num_elems - 1;
+			rolled = true;
+		}
+
+		curArrayKey->cur_elem = cur_elem;
+		skey->sk_argument = curArrayKey->elem_values[cur_elem];
+		if (!rolled)
+			return true;
+
+		/* Need to advance next array key, if any */
+	}
+
+	/*
+	 * The array keys are now exhausted.
+	 *
+	 * There isn't actually a distinct state that represents array exhaustion,
+	 * since index scans don't always end when btgettuple returns "false". The
+	 * scan direction might be reversed, or the scan might yet have its last
+	 * saved position restored.
+	 *
+	 * Restore the array keys to the state they were in immediately before we
+	 * were called.  This ensures that the arrays can only ever ratchet in the
+	 * scan's current direction.  Without this, scans would overlook matching
+	 * tuples if and when the scan's direction was subsequently reversed.
+	 */
+	_bt_start_array_keys(scan, -dir);
+
+	return false;
+}
+
+/*
+ * _bt_rewind_nonrequired_arrays() -- Rewind non-required arrays
+ *
+ * Called when _bt_advance_array_keys decides to start a new primitive index
+ * scan on the basis of the current scan position being before the position
+ * that _bt_first is capable of repositioning the scan to by applying an
+ * inequality operator required in the opposite-to-scan direction only.
+ *
+ * Although equality strategy scan keys (for both arrays and non-arrays alike)
+ * are either marked required in both directions or in neither direction,
+ * there is a sense in which non-required arrays behave like required arrays.
+ * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
+ * the scan key on "c" is non-required, but nevertheless enables positioning
+ * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
+ * first descent of the tree by _bt_first.  Later on, there could also be a
+ * second descent, that places the scan right before tuples >= "(200, 3, 5)".
+ * _bt_first must never be allowed to build an insertion scan key whose "c"
+ * entry is set to a value other than 5, the "c" array's first element/value.
+ * (Actually, it's the first in the current scan direction.  This example uses
+ * a forward scan.)
+ *
+ * Calling here resets the array scan key elements for the scan's non-required
+ * arrays.  This is strictly necessary for correctness in a subset of cases
+ * involving "required in opposite direction"-triggered primitive index scans.
+ * Not all callers are at risk of _bt_first using a non-required array like
+ * this, but advancement always resets the arrays, just to keep things simple.
+ * Array advancement even makes sure to reset non-required arrays like this
+ * during scans that have no inequalities.  Advancement won't ever need to
+ * call here, though that's just because it is all handled indirectly instead.
+ *
+ * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
+ * everybody got this right.  Note that this only happens between each call to
+ * _bt_first (never after the final _bt_first call).
+ */
+static void
+_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+	bool		arrays_advanced = false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
+
+		if (!(cur->sk_flags & SK_SEARCHARRAY) &&
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		Assert(array->scan_key == ikey);
+
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
+			continue;
+
+		if (ScanDirectionIsForward(dir) || !array)
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+		{
+			array->cur_elem = first_elem_dir;
+			cur->sk_argument = array->elem_values[first_elem_dir];
+			arrays_advanced = true;
+		}
+	}
+
+	if (arrays_advanced)
+		so->advanceDir = dir;
+}
+
+/*
+ * _bt_rewind_array_keys() -- Handle array keys during btrestrpos
+ *
+ * Restore the array keys to the start of the key space for the current scan
+ * direction as of the last time the arrays advanced.
+ *
+ * Once the scan reaches _bt_advance_array_keys, the arrays will advance up to
+ * the key space of the actual tuples from the mark position's leaf page.
+ */
+void
+_bt_rewind_array_keys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(!ScanDirectionIsNoMovement(so->advanceDir));
+	Assert(so->qual_ok && so->numArrayKeys);
+
+	/*
+	 * First reinitialize the array keys to the first elements for the scan
+	 * direction at the time that the arrays last advanced
+	 */
+	_bt_start_array_keys(scan, so->advanceDir);
+
+	/*
+	 * Next invert the scan direction as of the last time the array keys
+	 * advanced.
+	 *
+	 * This prevents _bt_steppage from fully trusting currPos.moreRight and
+	 * currPos.moreLeft in cases where _bt_readpage/_bt_checkkeys don't get
+	 * the opportunity to consider advancing the array keys as expected.
+	 */
+	if (ScanDirectionIsForward(so->advanceDir))
+		so->advanceDir = BackwardScanDirection;
+	else
+		so->advanceDir = ForwardScanDirection;
+
+	so->scanBehind = true;
+	so->needPrimScan = false;
+}
+
+/*
+ * _bt_tuple_before_array_skeys() -- determine if tuple advances array keys
+ *
+ * We always compare the tuple using the current array keys (which we assume
+ * are already set in so->keyData[]).  readpagetup indicates if tuple is the
+ * scan's current _bt_readpage-wise tuple.
+ *
+ * readpagetup callers must only call here when _bt_check_compare already set
+ * continuescan=false.  We help these callers deal with _bt_check_compare's
+ * inability to distinguishing between the < and > cases (it uses equality
+ * operator scan keys, whereas we use 3-way ORDER procs).
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This happens to readpagetup callers when tuple is still before the
+ * start of matches for the scan's current required array keys.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This happens to readpagetup callers when the
+ * scan has reached the point of needing its array keys advanced.
+ *
+ * As an optimization, readpagetup callers pass a _bt_check_compare-set sktrig
+ * value to indicate which scan key triggered _bt_checkkeys to recheck with us
+ * (!readpagetup callers must always pass sktrig=0).  This allows us to avoid
+ * wastefully checking earlier scan keys that _bt_check_compare already found
+ * to be satisfied by the current qual/set of array keys.  If sktrig indicates
+ * a non-required array that _bt_check_compare just set continuescan=false for
+ * (see _bt_check_compare for an explanation), then we always return false.
+ *
+ * !readpagetup callers optionally pass us *scanBehind, which tracks whether
+ * any missing truncated attributes might have affected array advancement
+ * (compared to what would happen if it was shown the first non-pivot tuple on
+ * the page to the right of caller's finaltup/high key tuple instead).  It's
+ * only possible that we'll set *scanBehind to true when caller passes us a
+ * pivot tuple (with truncated attributes) that we return false for.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+							 IndexTuple tuple, bool readpagetup, int sktrig,
+							 bool *scanBehind)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel);
+
+	Assert(so->numArrayKeys);
+	Assert(so->numberOfKeys);
+	Assert(!so->needPrimScan);
+	Assert(sktrig == 0 || readpagetup);
+	Assert(!readpagetup || scanBehind == NULL);
+
+	if (scanBehind)
+		*scanBehind = false;
+
+	for (; sktrig < so->numberOfKeys; sktrig++)
+	{
+		ScanKey		cur = so->keyData + sktrig;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
+
+		/*
+		 * Once we reach a non-required scan key, we're completely done.
+		 *
+		 * Note: we deliberately don't consider the scan direction here.
+		 * _bt_advance_array_keys caller requires that we track *scanBehind
+		 * without concern for scan direction.
+		 */
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) == 0)
+		{
+			Assert(!readpagetup || (cur->sk_strategy == BTEqualStrategyNumber &&
+									(cur->sk_flags & SK_SEARCHARRAY)));
+			return false;
+		}
+
+		/* readpagetup calls require one ORDER proc comparison (at most) */
+		Assert(!readpagetup || cur == so->keyData + sktrig);
+
+		if (cur->sk_attno > ntupatts)
+		{
+			Assert(!readpagetup);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys (but set *scanBehind to let interested callers know
+			 * that a truncated attribute might have affected our answer).
+			 */
+			if (scanBehind)
+				*scanBehind = true;
+
+			return false;
+		}
+
+		/*
+		 * Inequality strategy scan keys (that are required in current scan
+		 * direction) can only be evaluated by _bt_check_compare
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+		{
+			/*
+			 * Give up right away when _bt_check_compare indicated that a
+			 * required inequality scan key wasn't satisfied
+			 */
+			if (readpagetup)
+				return false;
+
+			/*
+			 * Otherwise we can't give up.  There can't be any required
+			 * equality strategy scan keys after this one, but we still need
+			 * to maintain *scanBehind for any later required inequality keys.
+			 */
+			continue;
+		}
+
+		orderproc = &so->orderProcs[so->keyDataMap[sktrig]];
+		tupdatum = index_getattr(tuple, cur->sk_attno, itupdesc, &tupnull);
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?  (Must be if we get here during a readpagetup call.)
+		 */
+		if (readpagetup || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconclusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or a call made from an assertion.
+		 */
+		Assert(result == 0);
+		Assert(!readpagetup);
+	}
+
+	return false;
+}
+
+/*
+ * _bt_start_prim_scan() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys,
+ * after _bt_first or _bt_next return false.
+ */
+bool
+_bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(so->advanceDir == dir || !so->qual_ok);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  The flag tells us what to do.
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys.  There are
+	 * various cases where that won't happen.  For example, if the index is
+	 * completely empty, then _bt_first won't call _bt_readpage/_bt_checkkeys.
+	 * We also don't expect a call to _bt_checkkeys during searches for a
+	 * non-existent value that happens to be lower/higher than any existing
+	 * value in the index.
+	 *
+	 * We don't require special handling for these cases -- we don't need to
+	 * be explicitly instructed to _not_ perform another primitive index scan.
+	 * It's up to code under the control of _bt_first to always set the flag
+	 * when another primitive index scan will be required.
+	 *
+	 * This works correctly, even with the tricky cases listed above, which
+	 * all involve access to leaf pages "near the boundaries of the key space"
+	 * (whether it's from a leftmost/rightmost page, or an imaginary empty
+	 * leaf root page).  If _bt_checkkeys cannot be reached by a primitive
+	 * index scan for one set of array keys, then it also won't be reached for
+	 * any later set ("later" in terms of the direction that we scan the index
+	 * and advance the arrays).  The array keys won't have advanced in these
+	 * cases, but that's the correct behavior (even _bt_advance_array_keys
+	 * won't always advance the arrays at the point they become "exhausted").
+	 */
+	if (so->needPrimScan)
+	{
+		Assert(_bt_verify_arrays_bt_first(scan, dir));
+
+		/* Flag was set -- must call _bt_first again */
+		so->needPrimScan = false;
+		so->scanBehind = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/* The top-level index scan ran out of tuples in this scan direction */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * The scan always gets a new qual as a consequence of calling here (except
+ * when we determine that the top-level scan has run out of matching tuples).
+ * All later _bt_check_compare calls also use the same new qual that was first
+ * used here (at least until the next call here advances the keys once again).
+ * It's convenient to structure _bt_check_compare rechecks of caller's tuple
+ * (using the new qual) as one the steps of advancing the scan's array keys,
+ * so this function works as a wrapper around _bt_check_compare.
+ *
+ * Like _bt_check_compare, we'll set pstate.continuescan on behalf of the
+ * caller, and return a boolean indicating if caller's tuple satisfies the
+ * scan's new qual.  But unlike _bt_check_compare, we set so->needPrimScan
+ * when we set continuescan=false, indicating if a new primitive index scan
+ * has been scheduled (otherwise, the top-level scan has run out of tuples in
+ * the current scan direction).
+ *
+ * Caller must use _bt_tuple_before_array_skeys to determine if the current
+ * place in the scan is >= the current array keys _before_ calling here.
+ * We're responsible for ensuring that caller's tuple is <= the newly advanced
+ * required array keys once we return.  We try to find an exact match, but
+ * failing that we'll advance the array keys to whatever set of array elements
+ * comes next in the key space for the current scan direction.  Required array
+ * keys "ratchet forwards" (or backwards).  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The rules are the same for backwards scans, except that the operators are
+ * flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Sometimes that'll
+ * be the sole reason for calling here.  These calls are the only exception to
+ * the general rule about always advancing the array keys.  (That rule only
+ * applies when a required scan key was found to be unsatisfied.)
+ *
+ * Note that we deal with non-array required equality strategy scan keys as
+ * degenerate single element arrays here.  Obviously, they can never really
+ * advance in the way that real arrays can, but they must still affect how we
+ * advance real array scan keys (exactly like true array equality scan keys).
+ * We have to keep around a 3-way ORDER proc for these (using the "=" operator
+ * won't do), since in general whether the tuple is < or > _any_ unsatisfied
+ * required equality key influences how the scan's real arrays must advance.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple.  This is how we deal with inequalities that are
+ * required in the current scan direction.  They can advance the array keys
+ * here, even though they don't influence the initial positioning strategy
+ * within _bt_first (only inequalities required in the _opposite_ direction to
+ * the scan influence _bt_first in this way).  When sktrig (which is an offset
+ * to the unsatisfied scan key set by _bt_check_compare) is for a required
+ * inequality scan key, we'll perform array key advancement.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	int			arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_exhausted,
+				beyond_end_advance = false,
+				sktrig_required = false,
+				has_required_opposite_direction_only = false,
+				oppodir_inequality_sktrig = false,
+				all_required_satisfied = true;
+
+	/*
+	 * Precondition array state assertions
+	 */
+	Assert(!so->needPrimScan && so->advanceDir == dir);
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, false, 0, NULL));
+
+	so->scanBehind = false;		/* reset */
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		FmgrInfo   *orderproc;
+		BTArrayKeyInfo *array = NULL;
+		Datum		tupdatum;
+		bool		required = false,
+					required_opposite_direction_only = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			/* Manage array state */
+			if (cur->sk_flags & SK_SEARCHARRAY)
+			{
+				array = &so->arrayKeys[arrayidx++];
+				Assert(array->scan_key == ikey);
+			}
+		}
+		else
+		{
+			/*
+			 * Are any inequalities required in the opposite direction only
+			 * present here?
+			 */
+			if (((ScanDirectionIsForward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQBKWD))) ||
+				 (ScanDirectionIsBackward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQFWD)))))
+				has_required_opposite_direction_only =
+					required_opposite_direction_only = true;
+		}
+
+		/* Optimization: skip over known-satisfied scan keys */
+		if (ikey < sktrig)
+			continue;
+
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			required = true;
+
+			if (ikey == sktrig)
+				sktrig_required = true;
+
+			if (cur->sk_attno > ntupatts)
+			{
+				/* Set this just like _bt_tuple_before_array_skeys */
+				Assert(sktrig < ikey);
+				so->scanBehind = true;
+			}
+		}
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now. (We only do comparisons of any required
+		 * non-array equality scan keys that come after the triggering key.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; we aren't capable of directly
+		 * evaluating required inequality strategy scan keys here, on our own.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(required && all_required_satisfied && !arrays_advanced);
+
+			/* Use "beyond end" advancement.  See below for an explanation. */
+			beyond_end_advance = true;
+			all_required_satisfied = false;
+
+			/*
+			 * Set a flag that remembers that this was an inequality required
+			 * in the opposite scan direction only, that nevertheless
+			 * triggered the call here.
+			 *
+			 * This only happens when an inequality operator (which must be
+			 * strict) encounters a group of NULLs that indicate the end of
+			 * non-NULL values for tuples in the current scan direction.
+			 */
+			if (unlikely(required_opposite_direction_only))
+				oppodir_inequality_sktrig = true;
+
+			continue;
+		}
+
+		/*
+		 * Nothing more for us to do with an inequality strategy scan key that
+		 * wasn't the one that _bt_check_compare stopped on, though.
+		 *
+		 * Note: if our later call to _bt_check_compare (to recheck caller's
+		 * tuple) sets continuescan=false due to finding this same inequality
+		 * unsatisfied (possible when it's required in the scan direction), we
+		 * deal with it via a recursive call.
+		 */
+		else if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Nothing for us to do with an equality strategy scan key that isn't
+		 * marked required, either.
+		 *
+		 * Non-required array scan keys are the only exception.  They're a
+		 * special case in that _bt_check_compare can set continuescan=false
+		 * for them, just as it will given an unsatisfied required scan key.
+		 * It's convenient to follow the same convention, since it results in
+		 * our getting called to advance non-required arrays in the same way
+		 * as required arrays (though we avoid stopping the scan for them).
+		 */
+		else if (!required && !array)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				cur->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_required_satisfied || cur->sk_attno > ntupatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				cur->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		orderproc = &so->orderProcs[so->keyDataMap[ikey]];
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		ratchets = (required && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+											  tupdatum, tupnull,
+											  array, cur, &result);
+
+			/*
+			 * Required arrays only ever ratchet forwards (backwards).
+			 *
+			 * This condition makes it safe for binary searches to skip over
+			 * array elements that the scan must already be ahead of by now.
+			 * That is strictly an optimization.  Our assertion verifies that
+			 * the condition holds, which doesn't depend on the optimization.
+			 */
+			Assert(!ratchets ||
+				   ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+					(ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(required);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling for required scan keys that lack
+		 * a real array to advance, nor for redundant scan keys that couldn't
+		 * be eliminated by _bt_preprocess_keys.  It won't matter if some of
+		 * our "true" array scan keys (or even all of them) are non-required.
+		 */
+		if (required &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		if (result != 0)
+		{
+			/*
+			 * Track whether caller's tuple satisfies our new post-advancement
+			 * qual, though only in respect of its required scan keys.
+			 *
+			 * When it's a non-required array that doesn't match, we can give
+			 * up early, without advancing the array (nor any later
+			 * non-required arrays).  This often saves us an unnecessary
+			 * recheck call to _bt_check_compare.
+			 */
+			Assert(all_required_satisfied);
+			if (required)
+				all_required_satisfied = false;
+			else
+				break;
+		}
+
+		/* Advance array keys, even when set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			cur->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+		}
+	}
+
+	/*
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is how top-level index
+	 * scans usually determine that they've run out of tuples to return.
+	 */
+	arrays_exhausted = false;
+	if (beyond_end_advance)
+	{
+		Assert(!all_required_satisfied && sktrig_required);
+
+		if (_bt_advance_array_keys_increment(scan, dir))
+			arrays_advanced = true;
+		else
+			arrays_exhausted = true;
+	}
+
+	if (arrays_advanced)
+	{
+		if (sktrig_required)
+		{
+			/*
+			 * One or more required array keys advanced, so invalidate state
+			 * that tracks whether required-in-opposite-direction-only scan
+			 * keys are already known to be satisfied
+			 */
+			pstate->firstmatch = false;
+
+			/* Shouldn't have to invalidate 'prechecked', though */
+			Assert(!pstate->prechecked);
+		}
+	}
+	else
+		Assert(arrays_exhausted || !sktrig_required);
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	if (arrays_exhausted)
+	{
+		Assert(sktrig_required);
+		Assert(!all_required_satisfied);
+
+		/*
+		 * The top-level index scan ran out of tuples to return
+		 */
+		goto end_toplevel_scan;
+	}
+
+	/*
+	 * Does caller's tuple now match the new qual?  Call _bt_check_compare a
+	 * second time to find out (unless it's already clear that it can't).
+	 */
+	if (all_required_satisfied && arrays_advanced)
+	{
+		int			nsktrig = sktrig + 1;
+
+		if (_bt_check_compare(dir, so, tuple, ntupatts, tupdesc,
+							  false, false, false,
+							  &pstate->continuescan, &nsktrig) &&
+			!so->scanBehind)
+		{
+			/* This tuple satisfies the new qual */
+			return true;
+		}
+
+		/*
+		 * Consider "second pass" handling of required inequalities.
+		 *
+		 * It's possible that our _bt_check_compare call indicated that the
+		 * scan should end due to some unsatisfied inequality that wasn't
+		 * initially recognized as such by us.  Handle this by calling
+		 * ourselves recursively, this time indicating that the trigger is the
+		 * inequality that we missed first time around (and using a set of
+		 * required array/equality keys that are now exact matches for tuple).
+		 *
+		 * We make a strong, general guarantee that every _bt_checkkeys call
+		 * here will advance the array keys to the maximum possible extent
+		 * that we can know to be safe based on caller's tuple alone.  If we
+		 * didn't perform this step, then that guarantee wouldn't quite hold.
+		 */
+		if (unlikely(!pstate->continuescan))
+		{
+			bool		satisfied PG_USED_FOR_ASSERTS_ONLY;
+
+			Assert(so->keyData[nsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * The tuple must use "beyond end" advancement during the
+			 * recursive call, so we cannot possibly end up back here when
+			 * recursing.  We'll consume a small, fixed amount of stack space.
+			 */
+			Assert(!beyond_end_advance);
+
+			/* Advance the array keys a second time for same tuple */
+			satisfied = _bt_advance_array_keys(scan, pstate, tuple, nsktrig);
+
+			/* This tuple doesn't satisfy the inequality */
+			Assert(!satisfied);
+			return false;
+		}
+
+		/*
+		 * Some non-required scan key (from new qual) still not satisfied.
+		 *
+		 * All scan keys required in the current scan direction must still be
+		 * satisfied, though, so we can trust all_required_satisfied below.
+		 *
+		 * Note: it's still too early to tell if the current primitive index
+		 * scan can continue (has_required_opposite_direction_only steps might
+		 * still start a new primitive index scan instead).
+		 */
+	}
+
+	/*
+	 * Postcondition array state assertions (for still-unsatisfied tuples).
+	 *
+	 * Caller's tuple is now < the newly advanced array keys (or > when this
+	 * is a backwards scan) when not all required scan keys from the new qual
+	 * (including any required inequality keys) were found to be satisfied.
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, false, 0, NULL) ==
+		   !all_required_satisfied);
+
+	/*
+	 * When we were called just to deal with "advancing" non-required arrays,
+	 * there's no way that we can need to start a new primitive index scan
+	 * (and it would be wrong to allow it).  Continue ongoing primitive scan.
+	 */
+	if (!sktrig_required)
+		goto continue_prim_scan;
+
+	/*
+	 * By here we have established that the scan's required arrays were
+	 * advanced, and that they haven't become exhausted.
+	 */
+	Assert(arrays_advanced || !arrays_exhausted);
+
+	/*
+	 * We generally permit primitive index scans to continue onto the next
+	 * sibling page when the page's finaltup satisfies all required scan keys
+	 * at the point where we're between pages.
+	 *
+	 * If caller's tuple is also the page's finaltup, and we see that required
+	 * scan keys still aren't satisfied, start a new primitive index scan.
+	 */
+	if (!all_required_satisfied && pstate->finaltup == tuple)
+		goto new_prim_scan;
+
+	/*
+	 * Proactively check finaltup (don't wait until finaltup is reached by the
+	 * scan) when it might well turn out to not be satisfied later on.
+	 *
+	 * This isn't quite equivalent to looking ahead to check if finaltup will
+	 * also be satisfied by all required scan keys, since there isn't any real
+	 * handling of inequalities in _bt_tuple_before_array_skeys.  It wouldn't
+	 * make sense for us to evaluate inequalities when "looking ahead to
+	 * finaltup", though.  Inequalities that are required in the current scan
+	 * direction cannot affect how _bt_first repositions the top-level scan
+	 * (unless the scan direction happens to change).
+	 *
+	 * Note: if so->scanBehind hasn't already been set for finaltup by us,
+	 * it'll be set during this call to _bt_tuple_before_array_skeys.  Either
+	 * way it'll be set correctly after this point.
+	 */
+	if (!all_required_satisfied && pstate->finaltup &&
+		_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, false, 0,
+									 &so->scanBehind))
+		goto new_prim_scan;
+
+	/*
+	 * When we encounter a truncated finaltup high key attribute, we're
+	 * optimistic about the chances of its corresponding required scan key
+	 * being satisfied when we go on to check it against tuples from this
+	 * page's right sibling leaf page.  We consider truncated attributes to be
+	 * satisfied by required scan keys, which allows the primitive index scan
+	 * to continue to the next leaf page.  We must set so->scanBehind to true
+	 * to remember that the last page's finaltup had "satisfied" required scan
+	 * keys for one or more truncated attribute values (scan keys required in
+	 * _either_ scan direction).
+	 *
+	 * There is a chance that _bt_checkkeys (which checks so->scanBehind) will
+	 * find that even the sibling leaf page's finaltup is < the new array
+	 * keys.  When that happens, our optimistic policy will have incurred a
+	 * single extra leaf page access that could have been avoided.
+	 *
+	 * A pessimistic policy would give backward scans a gratuitous advantage
+	 * over forward scans.  We'd punish forward scans for applying more
+	 * accurate information from the high key, rather than just using the
+	 * final non-pivot tuple as finaltup, in the style of backward scans.
+	 * Being pessimistic would also give some scans with non-required arrays a
+	 * perverse advantage over similar scans that use required arrays instead.
+	 *
+	 * You can think of this as a speculative bet on what the scan is likely
+	 * to find on the next page.  It's not much of a gamble, though, since the
+	 * untruncated prefix of attributes must strictly satisfy the new qual
+	 * (though it's okay if any non-required scan keys fail to be satisfied).
+	 */
+	if (so->scanBehind && has_required_opposite_direction_only)
+	{
+		/*
+		 * However, we avoid this behavior whenever the scan involves a scan
+		 * key required in the opposite direction to the scan only, along with
+		 * a finaltup with at least one truncated attribute that's associated
+		 * with a scan key marked required (required in either direction).
+		 *
+		 * _bt_check_compare simply won't stop the scan for a scan key that's
+		 * marked required in the opposite scan direction only.  That leaves
+		 * us without any reliable way of reconsidering any opposite-direction
+		 * inequalities if it turns out that starting a new primitive index
+		 * scan will allow _bt_first to skip ahead by a great many leaf pages
+		 * (see next section for details of how that works).
+		 */
+		goto new_prim_scan;
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 * They can also signal that we should start a new primitive index scan.
+	 *
+	 * It's possible that the scan is now positioned where "matching" tuples
+	 * begin, and that caller's tuple satisfies all scan keys required in the
+	 * current scan direction.  But if caller's tuple still doesn't satisfy
+	 * other scan keys that are required in the opposite scan direction only
+	 * (e.g., a required >= strategy scan key when scan direction is forward),
+	 * it's still possible that there are many leaf pages before the page that
+	 * _bt_first could skip straight to.  Groveling through all those pages
+	 * will always give correct answers, but it can be very inefficient.  We
+	 * must avoid needlessly scanning extra pages.
+	 *
+	 * Separately, it's possible that _bt_check_compare set continuescan=false
+	 * for a scan key that's required in the opposite direction only.  This is
+	 * a special case, that happens only when _bt_check_compare sees that the
+	 * inequality encountered a NULL value.  This signals the end of non-NULL
+	 * values in the current scan direction, which is reason enough to end the
+	 * (primitive) scan.  If this happens at the start of a large group of
+	 * NULL values, then we shouldn't expect to be called again until after
+	 * the scan has already read indefinitely-many leaf pages full of tuples
+	 * with NULL suffix values.  We need a separate test for this case so that
+	 * we don't miss our only opportunity to skip over such a group of pages.
+	 *
+	 * Apply a test against finaltup to detect and recover from the problem:
+	 * if even finaltup doesn't satisfy such an inequality, we just skip by
+	 * starting a new primitive index scan.  When we skip, we know for sure
+	 * that all of the tuples on the current page following caller's tuple are
+	 * also before the _bt_first-wise start of tuples for our new qual.  That
+	 * at least suggests many more skippable pages beyond the current page.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	if (has_required_opposite_direction_only && pstate->finaltup &&
+		(all_required_satisfied || oppodir_inequality_sktrig))
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			cur_elem = curArrayKey->cur_elem;
-		int			num_elems = curArrayKey->num_elems;
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped;
+		bool		continuescanflip;
+		int			opsktrig;
 
-		if (ScanDirectionIsBackward(dir))
-		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-		else
+		/*
+		 * We're checking finaltup (which is usually not caller's tuple), so
+		 * cannot reuse work from caller's earlier _bt_check_compare call.
+		 *
+		 * Flip the scan direction when calling _bt_check_compare this time,
+		 * so that it will set continuescanflip=false when it encounters an
+		 * inequality required in the opposite scan direction.
+		 */
+		Assert(!so->scanBehind);
+		opsktrig = 0;
+		flipped = -dir;
+		_bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+						  tupdesc, false, false, false,
+						  &continuescanflip, &opsktrig);
+
+		/*
+		 * If we ended up here due to the all_required_satisfied criteria,
+		 * test opsktrig in a way that ensures that finaltup contains the same
+		 * prefix of key columns as caller's tuple (a prefix that satisfies
+		 * earlier required-in-current-direction scan keys).
+		 *
+		 * If we ended up here due to the oppodir_inequality_sktrig criteria,
+		 * test opsktrig in a way that ensures that the same scan key that our
+		 * caller found to be unsatisfied (by the scan's tuple) was also the
+		 * one unsatisfied just now (by finaltup).  That way we'll only start
+		 * a new primitive scan when we're sure that both tuples _don't_ share
+		 * the same prefix of satisfied equality-constrained attribute values,
+		 * and that finaltup has a non-NULL attribute value indicated by the
+		 * unsatisfied scan key at offset opsktrig/sktrig.  (This depends on
+		 * _bt_check_compare not caring about the direction that inequalities
+		 * are required in whenever NULL attribute values are unsatisfied.  It
+		 * only cares about the scan direction, and its relationship to
+		 * whether NULLs are stored first or last relative to non-NULLs.)
+		 */
+		Assert(all_required_satisfied != oppodir_inequality_sktrig);
+		if (unlikely(!continuescanflip &&
+					 ((all_required_satisfied && opsktrig > sktrig) ||
+					  (oppodir_inequality_sktrig && opsktrig == sktrig))))
 		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
+			Assert(so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber);
 
-		curArrayKey->cur_elem = cur_elem;
-		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
+			/*
+			 * Make sure that any non-required arrays are set to the first
+			 * array element for the current scan direction
+			 */
+			_bt_rewind_nonrequired_arrays(scan, dir);
+
+			goto new_prim_scan;
+		}
 	}
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+continue_prim_scan:
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * Stick with the ongoing primitive index scan for now.
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will quickly discover the next tuple covered
+	 * by the current array keys.  Repeated binary searches of the page (one
+	 * binary search per array advancement) is unlikely to outperform one
+	 * continuous linear search of the whole page.
 	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
-}
-
-/*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
- *
- * Save the current state of the array keys as the "mark" position.
- */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
+	pstate->continuescan = true;	/* Override _bt_check_compare */
+	so->needPrimScan = false;	/* _bt_readpage has more tuples to check */
 
-	for (i = 0; i < so->numArrayKeys; i++)
-	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+	/* Caller's tuple doesn't match the new qual */
+	return false;
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
-	}
-}
+new_prim_scan:
 
-/*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
- *
- * Restore the array keys to where they were when the mark was set.
- */
-void
-_bt_restore_array_keys(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		changed = false;
-	int			i;
+	/*
+	 * End this primitive index scan, but scheduled another
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = true;	/* ...but call _bt_first again */
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
-	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+	/* Caller's tuple doesn't match the new qual */
+	return false;
 
-		if (curArrayKey->cur_elem != mark_elem)
-		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
-			changed = true;
-		}
-	}
+end_toplevel_scan:
 
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
+	 * End the current primitive index scan, but don't schedule another.
 	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * This ends the entire top-level scan.
 	 */
-	if (changed || !so->arraysStarted)
-	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
-}
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = false;	/* ...don't call _bt_first again, though */
 
+	/* Caller's tuple doesn't match any qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
  *
- * The given search-type keys (in scan->keyData[] or so->arrayKeyData[])
+ * The given search-type keys (taken from scan->keyData[])
  * are copied to so->keyData[] with possible transformation.
  * scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
  * the number of output keys (possibly less, never greater).
@@ -691,7 +2064,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
  * within each attribute may be done as a byproduct of the processing here,
- * but no other code depends on that.
+ * but no other code depends on that.  Note that index scans with array scan
+ * keys depend on state (maintained here by us) that maps each of our input
+ * scan keys to its corresponding output scan key.  This indirection allows
+ * index scans to use an ikey offset-to-output-scankey to look up the cached
+ * ORDER proc for the scankey.
  *
  * The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
  * if they must be satisfied in order to continue the scan forward or backward
@@ -740,6 +2117,14 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * Again, missing cross-type operators might cause us to fail to prove the
  * quals contradictory when they really are, but the scan will work correctly.
  *
+ * _bt_checkkeys needs to be able to perform in-place updates of the scan keys
+ * output here by us.  This is the final step it performs in order to advance
+ * the scan's array keys.  The rules for redundancy/contradictoriness work a
+ * little differently when array-type scan keys are involved.  We need to
+ * consider every possible set of array keys.  During scans with array keys,
+ * only the first call here (per btrescan) will actually do any real work.
+ * Later calls just assert that _bt_checkkeys set things up correctly.
+ *
  * Row comparison keys are currently also treated without any smarts:
  * we just transfer them into the preprocessed array without any
  * editorialization.  We can treat them the same as an ordinary inequality
@@ -747,9 +2132,9 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * about required keys.
  *
  * Note: the reason we have to copy the preprocessed scan keys into private
- * storage is that we are modifying the array based on comparisons of the
- * key argument values, which could change on a rescan or after moving to
- * new elements of array keys.  Therefore we can't overwrite the source data.
+ * storage is that we are modifying the array based on comparisons of the key
+ * argument values, which could change on a rescan.  Therefore we can't
+ * overwrite the source data.
  */
 void
 _bt_preprocess_keys(IndexScanDesc scan)
@@ -761,12 +2146,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfEqualCols;
 	ScanKey		inkeys;
 	ScanKey		outkeys;
+	int		   *keyDataMap = NULL;
 	ScanKey		cur;
-	ScanKey		xform[BTMaxStrategyNumber];
+	ScanKeyAttr xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
 	AttrNumber	attno;
+	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+
+	Assert(!so->needPrimScan);
+
+	/*
+	 * We're called at the start of each primitive index scan during top-level
+	 * scans that use equality array keys.  We can reuse the scan keys that
+	 * were output at the start of the scan's first primitive index scan.
+	 * There is no need to perform exactly the same work more than once.
+	 */
+	if (so->numberOfKeys > 0)
+	{
+		/*
+		 * An earlier call to _bt_advance_array_keys already set everything up
+		 * for us.  Just assert that the scan's existing output scan keys are
+		 * consistent with its current array elements.
+		 */
+		Assert(so->numArrayKeys && !ScanDirectionIsNoMovement(so->advanceDir));
+		Assert(_bt_verify_keys_with_arraykeys(scan));
+		return;
+	}
+
+	Assert(ScanDirectionIsNoMovement(so->advanceDir));
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -775,11 +2184,31 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
+	arrayKeyData = _bt_preprocess_array_keys(scan);
+	if (!so->qual_ok)
+	{
+		/* unmatchable array, so give up */
+		so->qual_ok = false;
+		return;
+	}
+
 	/*
-	 * Read so->arrayKeyData if array keys are present, else scan->keyData
+	 * Treat arrayKeyData as our input if _bt_preprocess_array_keys just
+	 * allocated it, else just use scan->keyData.
 	 */
-	if (so->arrayKeyData != NULL)
-		inkeys = so->arrayKeyData;
+	if (arrayKeyData != NULL)
+	{
+		/*
+		 * Maintain a mapping from input scan keys to our final output scan
+		 * keys.  This gives _bt_advance_array_keys a convenient way to look
+		 * up each equality scan key's ORDER proc (including but not limited
+		 * to scan keys used for arrays).  The ORDER proc array stores entries
+		 * in the same order as corresponding scan keys appear in inkeys.
+		 */
+		inkeys = arrayKeyData;
+		keyDataMap = so->keyDataMap;
+	}
 	else
 		inkeys = scan->keyData;
 
@@ -800,6 +2229,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
 			_bt_mark_scankey_required(outkeys);
+		if (keyDataMap)
+			keyDataMap[0] = 0;
+
 		return;
 	}
 
@@ -857,15 +2289,16 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 * unsatisfiable in combination with any other index condition. By
 			 * the time we get here, that's been classified as an equality
 			 * check, and we've rejected any combination of it with a regular
-			 * equality condition; but not with other types of conditions.
+			 * equality condition (including those used with array keys); but
+			 * not with other types of conditions.
 			 */
-			if (xform[BTEqualStrategyNumber - 1])
+			if (xform[BTEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1];
+				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
 
 				for (j = BTMaxStrategyNumber; --j >= 0;)
 				{
-					ScanKey		chk = xform[j];
+					ScanKey		chk = xform[j].skey;
 
 					if (!chk || j == (BTEqualStrategyNumber - 1))
 						continue;
@@ -877,8 +2310,28 @@ _bt_preprocess_keys(IndexScanDesc scan)
 						return;
 					}
 
-					if (_bt_compare_scankey_args(scan, chk, eq, chk,
-												 &test_result))
+					if (eq->sk_flags & SK_SEARCHARRAY)
+					{
+						/*
+						 * Don't try to prove redundancy in the event of an
+						 * inequality strategy scan key that looks like it
+						 * might contradict a subset of the array elements
+						 * from some equality scan key's array.  Just keep
+						 * both keys.
+						 *
+						 * Ideally, we'd handle this by adding a preprocessing
+						 * step that eliminates the subset of array elements
+						 * that the inequality ipso facto rules out (and
+						 * eliminates the inequality itself, too).  But that
+						 * seems like a lot of code for such a small benefit
+						 * (_bt_checkkeys is already capable of advancing the
+						 * array keys by a great many elements in one step,
+						 * without requiring too many cycles compared to
+						 * sophisticated preprocessing).
+						 */
+					}
+					else if (_bt_compare_scankey_args(scan, chk, eq, chk,
+													  &test_result))
 					{
 						if (!test_result)
 						{
@@ -887,7 +2340,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							return;
 						}
 						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						xform[j].skey = NULL;
+						xform[j].ikey = -1;
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -896,36 +2350,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			}
 
 			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1]
-				&& xform[BTLessEqualStrategyNumber - 1])
+			if (xform[BTLessStrategyNumber - 1].skey
+				&& xform[BTLessEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1];
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1];
+				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
 
 				if (_bt_compare_scankey_args(scan, le, lt, le,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1] = NULL;
+						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTLessStrategyNumber - 1] = NULL;
+						xform[BTLessStrategyNumber - 1].skey = NULL;
 				}
 			}
 
 			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1]
-				&& xform[BTGreaterEqualStrategyNumber - 1])
+			if (xform[BTGreaterStrategyNumber - 1].skey
+				&& xform[BTGreaterEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1];
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1];
+				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
 
 				if (_bt_compare_scankey_args(scan, ge, gt, ge,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1] = NULL;
+						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTGreaterStrategyNumber - 1] = NULL;
+						xform[BTGreaterStrategyNumber - 1].skey = NULL;
 				}
 			}
 
@@ -936,11 +2390,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 */
 			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				if (xform[j])
+				if (xform[j].skey)
 				{
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-					memcpy(outkey, xform[j], sizeof(ScanKeyData));
+					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+					if (keyDataMap)
+						keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 					if (priorNumberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
 				}
@@ -960,12 +2416,29 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* check strategy this key's operator corresponds to */
 		j = cur->sk_strategy - 1;
 
+		/*
+		 * Is this an array scan key that _bt_preprocess_array_keys merged
+		 * into an earlier array key against the same attribute?
+		 */
+		if (cur->sk_strategy == InvalidStrategy)
+		{
+			/*
+			 * key is redundant for this primitive index scan (and will be
+			 * redundant during all subsequent primitive index scans)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+
+			continue;
+		}
+
 		/* if row comparison, push it directly to the output array */
 		if (cur->sk_flags & SK_ROW_HEADER)
 		{
 			ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
+			if (keyDataMap)
+				keyDataMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 
@@ -977,20 +2450,75 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
-		/* have we seen one of these before? */
-		if (xform[j] == NULL)
+		/*
+		 * have we seen a scan key for this same attribute and using this same
+		 * operator strategy before now?
+		 */
+		if (xform[j].skey == NULL)
 		{
-			/* nope, so remember this scankey */
-			xform[j] = cur;
+			/* nope, so this scan key wins by default (at least for now) */
+			xform[j].skey = cur;
+			xform[j].ikey = i;
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
-										 &test_result))
+			ScanKey		outkey;
+
+			/*
+			 * Seen one of these before, so keep only the more restrictive key
+			 * if possible
+			 */
+			if (j == (BTEqualStrategyNumber - 1) &&
+				((xform[j].skey->sk_flags & SK_SEARCHARRAY) ||
+				 (cur->sk_flags & SK_SEARCHARRAY)) &&
+				!(cur->sk_flags & SK_SEARCHNULL))
+			{
+				/*
+				 * But don't discard the existing equality key if it's an
+				 * array scan key.  We can't conclude that the key is truly
+				 * redundant with an array.  The only exception is "key IS
+				 * NULL" keys, which eliminate every possible array element
+				 * (and so ipso facto make the whole qual contradictory).
+				 *
+				 * Note: redundant and contradictory array keys will have
+				 * already been dealt with by _bt_merge_arrays in the most
+				 * important cases.  Ideally, _bt_merge_arrays would also be
+				 * able to handle all equality keys as "degenerate single
+				 * value arrays", but for now we're better off leaving it up
+				 * to _bt_checkkeys to advance the array keys.
+				 *
+				 * Note: another possible solution to this problem is to
+				 * perform incremental array advancement here instead.  That
+				 * doesn't seem particularly appealing, since it won't perform
+				 * acceptably during scans that have an extremely large number
+				 * of distinct array key combinations (typically due to the
+				 * presence of multiple arrays, each containing merely a large
+				 * number of distinct elements).
+				 *
+				 * Likely only redundant for a subset of array elements...
+				 */
+			}
+			else if (!_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+											   &test_result))
+			{
+				/*
+				 * Cannot determine redundancy because opfamily doesn't supply
+				 * a complete set of cross-type operators...
+				 */
+			}
+			else
 			{
+				/* Have all we need to determine redundancy */
 				if (test_result)
-					xform[j] = cur;
+				{
+					Assert(!(xform[j].skey->sk_flags & SK_SEARCHARRAY) ||
+						   xform[j].skey->sk_strategy != BTEqualStrategyNumber);
+
+					/* New key is more restrictive, and so replaces old key */
+					xform[j].skey = cur;
+					xform[j].ikey = i;
+					continue;
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -998,26 +2526,163 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					return;
 				}
 				/* else old key is more restrictive, keep it */
+				continue;
 			}
-			else
+
+			/*
+			 * ...so keep both keys.
+			 *
+			 * We can't determine which key is more restrictive (or we can't
+			 * eliminate an array scan key).  Replace it in xform[j], and push
+			 * the cur one directly to the output array, too.
+			 */
+			outkey = &outkeys[new_numberOfKeys++];
+
+			memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+			if (keyDataMap)
+				keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
+			if (numberOfEqualCols == attno - 1)
+				_bt_mark_scankey_required(outkey);
+			xform[j].skey = cur;
+			xform[j].ikey = i;
+		}
+	}
+
+	/*
+	 * When _bt_preprocess_array_keys performed array preprocessing, it set
+	 * each array's array->scan_key to the array's arrayKeys[] entry offset.
+	 *
+	 * Now that we've output so->keyData[], and built a mapping from
+	 * so->keyData[] (output scan keys) to scan->keyData[] (input scan keys),
+	 * fix the array->scan_key references.  (This relies on the assumption
+	 * that arrayKeys[] has essentially the same entries as scan->keyData[]).
+	 */
+	if (arrayKeyData)
+	{
+		int			arrayidx = 0;
+
+		for (int output_ikey = 0;
+			 output_ikey < new_numberOfKeys;
+			 output_ikey++)
+		{
+			ScanKey		outkey = so->keyData + output_ikey;
+			int			input_ikey = keyDataMap[output_ikey];
+
+			if (!(outkey->sk_flags & SK_SEARCHARRAY) ||
+				outkey->sk_strategy != BTEqualStrategyNumber)
+				continue;
+
+			for (; arrayidx < so->numArrayKeys; arrayidx++)
 			{
-				/*
-				 * We can't determine which key is more restrictive.  Keep the
-				 * previous one in xform[j] and push this one directly to the
-				 * output array.
-				 */
-				ScanKey		outkey = &outkeys[new_numberOfKeys++];
+				BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
 
-				memcpy(outkey, cur, sizeof(ScanKeyData));
-				if (numberOfEqualCols == attno - 1)
-					_bt_mark_scankey_required(outkey);
+				if (array->scan_key == input_ikey)
+				{
+					array->scan_key = output_ikey;
+					break;
+				}
 			}
 		}
+
+		/* We could pfree(arrayKeyData) now, but not worth the cycles */
 	}
 
 	so->numberOfKeys = new_numberOfKeys;
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * Verify that the scan's qual state matches what we expect at the point that
+ * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
+ *
+ * We enforce a rule against non-required array scan keys: they must start out
+ * with whatever element is the first for the scan's current scan direction.
+ * See _bt_rewind_nonrequired_arrays comments for an explanation.
+ */
+static bool
+_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
+
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+			return false;
+	}
+
+	return _bt_verify_keys_with_arraykeys(scan);
+}
+
+/*
+ * Verify that the scan's "so->keyData[]" scan keys are in agreement with
+ * its array key state
+ */
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			last_proc_map = -1,
+				last_sk_attno = 0,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		if (array->scan_key != ikey)
+			return false;
+
+		/*
+		 * Verify that so->keyDataMap[] mappings are in order for
+		 * SK_SEARCHARRAY equality strategy scan keys
+		 */
+		if (last_proc_map >= so->keyDataMap[ikey])
+			return false;
+		last_proc_map = so->keyDataMap[ikey];
+
+		if (cur->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+		if (last_sk_attno > cur->sk_attno)
+			return false;
+		last_sk_attno = cur->sk_attno;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1351,60 +3016,191 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
  * Forward scan callers can pass a high key tuple in the hopes of having
  * us set *continuescan to false, and avoiding an unnecessary visit to
  * the page to the right.
  *
+ * Advances the scan's array keys when necessary for arrayKeys=true callers.
+ * Caller can avoid all array related side-effects when calling just to do a
+ * page continuescan precheck -- pass arrayKeys=false for that.  Scans without
+ * any arrays keys must always pass arrayKeys=false.
+ *
+ * Also stops and starts primitive index scans for arrayKeys=true callers.
+ * Scans with array keys are required to set up page state that helps us with
+ * this.  The page's finaltup tuple (the page high key for a forward scan, or
+ * the page's first non-pivot tuple for a backward scan) must be set in
+ * pstate.finaltup ahead of the first call here for the page (or possibly the
+ * first call after an initial continuescan-setting page precheck call).  Set
+ * this to NULL for rightmost page (or the leftmost page for backwards scans).
+ *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: page level input and output parameters
+ * arrayKeys: should we advance the scan's array keys if necessary?
  * tuple: index tuple to test
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
- * 						   be true for the last item on the page
- * haveFirstMatch: indicates that we already have at least one match
- * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
-			  bool continuescanPrechecked, bool haveFirstMatch)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+			  IndexTuple tuple, int tupnatts)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanDirection dir = pstate->dir;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!arrayKeys || (so->advanceDir == dir && so->arrayKeys));
+	Assert(!so->scanBehind || (arrayKeys && ScanDirectionIsForward(dir)));
+	Assert(!so->needPrimScan);
 
-	*continuescan = true;		/* default assumption */
+	res = _bt_check_compare(dir, so, tuple, tupnatts, tupdesc,
+							arrayKeys, pstate->prechecked, pstate->firstmatch,
+							&pstate->continuescan, &ikey);
+
+#ifdef USE_ASSERT_CHECKING
+	if (pstate->prechecked || pstate->firstmatch)
+	{
+		bool		dcontinuescan;
+		int			dikey = 0;
+
+		Assert(res == _bt_check_compare(dir, so, tuple, tupnatts, tupdesc,
+										arrayKeys, false, false,
+										&dcontinuescan, &dikey));
+		Assert(dcontinuescan == pstate->continuescan && ikey == dikey);
+	}
+#endif
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality strategy array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * pstate.continuescan=false.
+	 */
+	if (!arrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.  This could mean that the tuple is just past
+	 * the end of matches for the current array keys.
+	 *
+	 * It's also possible that the scan is still _before_ the _start_ of
+	 * tuples matching the current set of array keys.  Check for that first.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, dir, tuple, true, ikey, NULL))
+	{
+		/*
+		 * Tuple is still before the start of matches according the the scan's
+		 * required array keys (according to _all_ of its required equality
+		 * strategy keys, actually).
+		 *
+		 * Note: we will end up here repeatedly given a group of tuples > the
+		 * previous array keys and < the now-current keys (though only when
+		 * _bt_advance_array_keys determined that key space relevant to the
+		 * scan covers some of the page's remaining unscanned tuples).
+		 *
+		 * _bt_advance_array_keys occasionally sets so->scanBehind to signal
+		 * that the scan's current position/tuples might be significantly
+		 * behind (multiple pages behind) its current array keys.  When this
+		 * happens, we check the page finaltup ourselves.  We'll start a new
+		 * primitive index scan on our own if it turns out that the scan isn't
+		 * now on a page that has at least some tuples covered by the key
+		 * space of the arrays.
+		 *
+		 * This scheme allows _bt_advance_array_keys to optimistically assume
+		 * that the scan will find array key matches for any truncated
+		 * finaltup attributes once the scan reaches the right sibling page
+		 * (only the untruncated prefix have to match the scan's array keys).
+		 */
+		Assert(!so->scanBehind ||
+			   so->keyData[ikey].sk_strategy == BTEqualStrategyNumber);
+		if (unlikely(so->scanBehind) && pstate->finaltup &&
+			_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, false,
+										 0, NULL))
+		{
+			/* Cut our losses -- start a new primitive index scan now */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/* Override _bt_check_compare, continue primitive scan */
+			pstate->continuescan = true;
+		}
+
+		/* This indextuple doesn't match the current qual, in any case */
+		return false;
+	}
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scan).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (unless the required array keys become exhausted instead, or
+	 * unless the ikey trigger corresponds to a non-required array scan key).
+	 *
+	 * Note: we might advance the required arrays when all existing keys are
+	 * already equal to the values from the tuple at this point.  See comments
+	 * above _bt_advance_array_keys about inequality driven array advancement.
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, ikey);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also sets *continuescan to false
+ * when it's also not possible for any later tuples to pass the current qual
+ * (with the scan's current set of array keys, in the current scan direction),
+ * in addition to setting *ikey to the so->keyData[] subscript/offset for the
+ * unsatisfied scan key (needed when caller must consider advancing the scan's
+ * array keys).
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys ends the ongoing
+ * primitive index scan.  It is up to our caller to override that initial
+ * determination when it makes more sense to advance the array keys and
+ * continue with further tuples from the same leaf page.
+ *
+ * Note: we set *continuescan to false for arrayKeys=true callers in the event
+ * of an unsatisfied non-required array equality scan key, despite the fact
+ * that it's never safe to end the current primitive index scan when that
+ * happens.  Caller will still need to consider "advancing" the array keys
+ * (which isn't all that different to what happens to truly required arrays).
+ * Caller _must_ unset continuescan once non-required arrays have advanced.
+ * Callers that pass arrayKeys=false won't get this behavior, which is useful
+ * when the focus is on whether the scan's required scan keys are satisfied.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool arrayKeys, bool prechecked, bool firstmatch,
+				  bool *continuescan, int *ikey)
+{
+	*continuescan = true;		/* default assumption */
 
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (; *ikey < so->numberOfKeys; (*ikey)++)
 	{
+		ScanKey		key = so->keyData + *ikey;
 		Datum		datum;
 		bool		isNull;
-		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1422,8 +3218,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
-			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
+		if (prechecked &&
+			(requiredSameDir || (requiredOppositeDirOnly && firstmatch)) &&
+			!(key->sk_flags & SK_ROW_HEADER))
 			continue;
 
 		if (key->sk_attno > tupnatts)
@@ -1434,7 +3231,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1495,6 +3291,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1511,6 +3309,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1524,24 +3324,15 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key-checking function, though only if we must.
+		 *
+		 * When a key is required in the opposite-of-scan direction _only_,
+		 * then it must already be satisfied if firstmatch=true indicates that
+		 * an earlier tuple from this same page satisfied it earlier on.
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
-		{
-			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-									 datum, key->sk_argument);
-		}
-		else
-		{
-			test = true;
-			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
-											 datum, key->sk_argument));
-		}
-
-		if (!DatumGetBool(test))
+		if (!(requiredOppositeDirOnly && firstmatch) &&
+			!DatumGetBool(FunctionCall2Coll(&key->sk_func, key->sk_collation,
+											datum, key->sk_argument)))
 		{
 			/*
 			 * Tuple fails this qual.  If it's a required qual for the current
@@ -1556,6 +3347,14 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			if (requiredSameDir)
 				*continuescan = false;
 
+			/*
+			 * Also set continuescan=false for non-required equality-type
+			 * array keys that don't pass (during arrayKeys=true calls)
+			 */
+			if (arrayKeys && (key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+				*continuescan = false;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1574,7 +3373,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1603,7 +3402,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
@@ -1630,6 +3428,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1646,6 +3446,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbdc..2230b13104 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,12 +816,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
 	int			indexcol;
 
+	Assert(skip_nonnative_saop != NULL || scantype == ST_BITMAPSCAN);
+
 	/*
 	 * Check that index supports the desired scan type(s)
 	 */
@@ -880,19 +849,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +864,18 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			if (skip_nonnative_saop && !index->amsearcharray &&
+				IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
-				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
+				/*
+				 * Caller asked us to generate IndexPaths that omit any
+				 * ScalarArrayOpExpr clauses when the underlying index AM
+				 * lacks native support.
+				 *
+				 * We must omit this clause (and tell caller about it).
+				 */
+				*skip_nonnative_saop = true;
+				continue;
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d4..47de61da10 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6557,8 +6557,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6573,7 +6571,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6603,19 +6601,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6653,27 +6640,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6683,11 +6674,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6703,10 +6692,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6719,7 +6706,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6797,7 +6784,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6812,17 +6798,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6862,14 +6843,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				double		alength = estimate_array_length(root, other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6929,13 +6905,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6945,6 +6914,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6952,9 +6963,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +6982,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052e..5f1c088a0e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,7 +960,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1022,9 +1022,8 @@ typedef BTScanPosData *BTScanPos;
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
-	int			scan_key;		/* index of associated key in arrayKeyData */
+	int			scan_key;		/* index of associated key in keyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1037,14 +1036,15 @@ typedef struct BTScanOpaqueData
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
 	/* workspace for SK_SEARCHARRAY support */
-	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
-	int			numArrayKeys;	/* number of equality-type array keys (-1 if
-								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	int			numArrayKeys;	/* number of equality-type array keys */
+	ScanDirection advanceDir;	/* Scan direction when arrays last advanced */
+	bool		scanBehind;		/* Scan might be behind arrays? */
+	bool		needPrimScan;	/* Need primscan to continue in advanceDir? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for all equality-type keys */
+	int		   *keyDataMap;		/* maps keyData entries to input scan keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1075,26 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* Needed by scans with array keys */
+
+	/* Output parameter, set by _bt_checkkeys for _bt_readpage */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/*
+	 * Input and output parameters, set and unset by both _bt_readpage and
+	 * _bt_checkkeys to manage precheck optimizations
+	 */
+	bool		prechecked;		/* precheck set continuescan? */
+	bool		firstmatch;		/* at least one match so far?  */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1152,7 +1172,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1243,15 +1263,12 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_preprocess_array_keys(IndexScanDesc scan);
+extern bool _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
+extern void _bt_rewind_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+						  IndexTuple tuple, int tupnatts);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 79fa117cb5..267cb72828 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1932,16 +1932,16 @@ ORDER BY unique1;
       42
 (3 rows)
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1952,26 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9605400021..a031d23415 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8880,10 +8880,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f300..90a33795de 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -765,6 +765,7 @@ SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -774,18 +775,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.40.1

#52

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#51)

2 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I've attached v14, where 0001 is v13, 0002 is a patch with small
changes + some large comment suggestions, and 0003 which contains
sorted merge join code for _bt_merge_arrays.

I have accepted your changes from 0003. Agree that it's better that
way. It's at least a little faster, but not meaningfully more
complicated.

This is part of my next revision, v15, which I've attached (along with
a test case that you might find useful, explained below).

I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I
can make of it.

That's been the big focus of this new v15, which now goes all out with
teaching _bt_preprocess_keys with how to deal with arrays. We now do
comprehensive normalization of even very complicated combinations of
arrays and non-array scan keys in this version.

For example, consider this query:

select *
from multi_test
where
a = any('{1, 2, 3, 4, 5}'::int[])
and
a > 2::bigint
and
a = any('{2, 3, 4, 5, 6}'::bigint[])
and
a < 6::smallint
and
a = any('{2, 3, 4, 5, 6}'::smallint[])
and
a < 4::int;

This has a total of 6 input scankeys -- 3 of which are arrays. But by
the time v15's _bt_preprocess_keys is done with it, it'll have only 1
scan key -- which doesn't even have an array (not anymore). And so we
won't even need to advance the array keys one single time -- there'll
simply be no array left to advance. In other words, it'll be just as
if the query was written this way from the start:

select *
from multi_test
where
a = 3::int;

(Though of course the original query will spend more cycles on
preprocessing, compared to this manually simplified variant.)

In general, preprocessing can now simplify queries like this to the
maximum extent possible (without bringing the optimizer into it), no
matter how much crud like this is added -- even including adversarial
cases, with massive arrays that have some amount of
redundancy/contradictoriness to them.

It turned out to not be terribly difficult to teach
_bt_preprocess_keys everything it could possibly need to know about
arrays, so that it can operate on them directly, as a variant of the
standard equality strategy (we do still need _bt_preprocess_array_keys
for basic preprocessing of arrays, mostly just merging them). This is
better overall (in that it gets every little subtlety right), but it
also simplified a number of related issues. For example, there is no
longer any need to maintain a mapping between so->keyData[]-wise scan
keys (output scan keys), and scan->keyData[]-wise scan keys (input
scan keys). We can just add a step to fix-up the references to the end
of _bt_preprocess_keys, to make life easier within
_bt_advance_array_keys.

This preprocessing work should all be happening during planning, not
during query execution -- that's the design that makes the most sense.
This is something we've discussed in the past in the context of skip
scan (see my original email to this thread for the reference). It
would be especially useful for the very fancy kinds of preprocessing
that are described by the MDAM paper, like using an index scan for a
NOT IN() list/array (this can actually make sense with a low
cardinality index).

The structure for preprocessing that I'm working towards (especially
in v15) sets the groundwork for making those shifts in the planner,
because we'll no longer treat each array constant as its own primitive
index scan during preprocessing. Right now, on HEAD, preprocessing
with arrays kinda treats each array constant like the parameter of an
imaginary inner index scan, from an imaginary nested loop join. But
the planner really needs to operate on the whole qual, all at once,
including any arrays. An actual nestloop join's inner index scan
naturally cannot do that, and so might still require runtime/dynamic
preprocessing in a world where that mostly happens in the planner --
but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE
a in(4, 5)" are almost the same thing, and so should be handled in
almost the same way by preprocessing).

+_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)

[...]
+        if (!(cur->sk_flags & SK_SEARCHARRAY) &&
+            cur->sk_strategy != BTEqualStrategyNumber)
+            continue;
This should use ||, not &&

Fixed. Yeah, that was a bug.

+        if (readpagetup || result != 0)
+        {
+            Assert(result != 0);
+            return false;
+        }
I'm confused about this. By asserting !readpagetup after this exit, we
could save a branch condition for the !readpagetup result != 0 path.
Can't we better assert the inverse just below, or is this specifically
for defense-in-depth against bug? E.g.
+        if (result != 0)
+            return false;
+
+        Assert(!readpagetup);

Yeah, that's what it was -- defensively. It seems slightly better
as-is, because you'll get an assertion failure if a "readpagetup"
caller gets "result == 0". That's never supposed to happen (if it did
happen then our ORDER proc won't be in agreement with what our =
operator indicated about the same tuple attribute value moments
earlier, inside _bt_check_compare).

+    /*
+     * By here we have established that the scan's required arrays were
+     * advanced, and that they haven't become exhausted.
+     */
+    Assert(arrays_advanced || !arrays_exhausted);

Should use &&, based on the comment.

Fixed by getting rid of the arrays_exhausted variable, which wasn't
adding much anyway.

+     * We generally permit primitive index scans to continue onto the next
+     * sibling page when the page's finaltup satisfies all required scan keys
+     * at the point where we're between pages.
This should probably describe that we permit primitive scans with
array keys to continue until we get to the sibling page, rather than
this rather obvious and generic statement that would cover even the
index scan for id > 0 ORDER BY id asc; or this paragraph can be
removed.

It's not quite obvious. The scan's array keys change as the scan makes
progress, up to once per tuple read. But even if it really was
obvious, it's really supposed to frame the later discussion --
discussion of those less common cases where this isn't what happens.
The exceptions. These exceptions are:

1. When a required scan key is deemed "satisfied" only because its
value was truncated in a high key finaltup. (Technically this is what
_bt_checkkeys has always done, but we're much more sensitive to this
stuff now, because we won't necessarily get to make another choice
about starting a new primitive index scan for a long time.)

2. When we apply the has_required_opposite_direction_only stuff, and
decide to start a new primitive index scan, even though technically
all of our required-in-this-direction scan keys are still satisfied.

Separately, there is also the potential for undesirable interactions
between 1 and 2, which is why we don't let them mix. (We have the "if
(so->scanBehind && has_required_opposite_direction_only) goto
new_prim_scan" gating condition.)

Further notes:

I have yet to fully grasp what so->scanBehind is supposed to mean. "/*
Scan might be behind arrays? */" doesn't give me enough hints here.

Yes, it is complicated. The best explanation is the one I've added to
_bt_readpage, next to the precheck. But that does need more work.

Note that the so->scanBehind thing solves two distinct problems for
the patch (related, but still clearly distinct). These problem are:

1. It makes the existing precheck/continuescan optimization in
_bt_readpage safe -- we'd sometimes get wrong answers to queries if we
didn't limit application of the optimization to !so->scanBehind cases.

I have a test case that proves that this is true -- the one I
mentioned in my introduction. I'm attaching that as
precheck_testcase.sql now. It might help you to understand
so->scanBehind, particularly this point 1 about the basic correctness
of the precheck thing (possibly point 2 also).

2. It is used to speculatively visit the next leaf page in corner
cases where truncated -inf attributes from the high key are deemed
"satisfied".

Generally speaking, we don't want to visit the next leaf page unless
we're already 100% sure that it might have matches (if any page has
matches for our current array keys at all, that is). But in a certain
sense we're only guessing. It isn't guaranteed (and fundamentally
can't be guaranteed) to work out once on the next sibling page. We've
merely assumed that our array keys satisfy truncated -inf columns,
without really knowing what it is that we'll find on the next page
when we actually visit it at that point (we're not going to see -inf
in the non-pivot tuples on the next page, we'll see some
real/non-sentinel low-sorting value).

We have good practical reasons to not want to treat them as
non-matching (though that would be correct), and to take this limited
gamble (we can discuss that some more if you'd like). Once you accept
that we have to do this, it follows that we need to be prepared to:

A. Notice that we're really just guessing in the sense I've described
(before leaving for the next sibling leaf page), by setting
so->scanBehind. We'll remember that it happened.

and:

B. Notice that that limited gamble didn't pay off once on the
next/sibling leaf page, so that we can cut our losses and start a new
primitive index scan at that point. We do this by checking
so->scanBehind (along with the sibling page's high key), once on the
sibling page. (We don't need explicit handling for the case when it
works out, as it almost always will.)

If you want to include discussion of problem 1 here too (not just
problem 2), then I should add a third thing that we need to notice:

C. Notice that we can't do the precheck thing once in _bt_readpage,
because it'd be wrong to allow it.

I find it weird that we call _bt_advance_array_keys for non-required
sktrig. Shouldn't that be as easy as doing a binary search through the
array? Why does this need to hit the difficult path?

What difficult path?

Advancing non-required arrays isn't that different to advancing
required ones. We will never advance required arrays when called just
to advance a non-required one, obviously. But we can do the opposite.
In fact, we absolutely have to advance non-required arrays (as best we
can) when advancing a required one (or when the call was triggered by
a non-array required scan key).

That said, it could have been clearer than it was in earlier versions.
v15 makes the difference between the non-required scan key trigger and
required scan key trigger cases clearer within _bt_advance_array_keys.

--
Peter Geoghegan

Attachments:

precheck_testcase.sqlapplication/octet-stream; name=precheck_testcase.sqlDownload

-- Run this test case with the !so->scanBehind gating condition removed from
-- _bt_readpage's precheck code, like so:

/*

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index be201652c..40787311d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1651,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-    if (!firstPage && !so->scanBehind && minoff < maxoff)
+    if (!firstPage && minoff < maxoff)
     {
         ItemId      iid;
         IndexTuple  itup;

*/

-- Running this script when the patch has been built with these modifcations
-- will lead to an assertion failure:
-- TRAP: failed Assert("res == _bt_check_compare(dir, so, tuple, tupnatts, tupdesc, arrayKeys, false, false, &dcontinuescan, &dikey)"), File: "../source/src/backend/access/nbtree/nbtutils.c", Line: 3488, PID: 430843
--
-- (Plus non-assert builds will get the wrong answer, due to _bt_readpage
-- returning a row that doesn't really satisfy the scan's qual)

-- Setup
drop table if exists precheck_testcase;
set enable_bitmapscan to on;
set enable_indexonlyscan to off;
set enable_indexscan to off;

-- Index has to have very wide tuples, to make it easy to defeat suffix
-- truncation:
create unlogged table precheck_testcase(a int4, b int4, c int4, filler text);
alter table plain_precheck_testcase set (autovacuum_enabled=off);
alter table precheck_testcase alter column filler set storage plain;
create index plain_precheck_testcase on precheck_testcase(a, b, c, filler);

-- Bulk insert:
insert into precheck_testcase
select
  42,
  i % 5,
  i,
  repeat(chr(ascii('0')), 2500)
from
  generate_series(1, 20) i;

-- Now delete a much smaller tuple (no filler) that's going to be the one the
-- query returns from end of first page (page at blkno 1):
insert into precheck_testcase select 41, -1, -1;
-- Delete other tuples in blkno 1:
delete from precheck_testcase
where a = 42
  and b = 0
  and c < 15;
-- Make sure the deleted tuples are gone from index:
vacuum verbose precheck_testcase;

select a, b, c
from precheck_testcase
where
  a = 42
  and b in (1, 2, 6, 7)
  and c in (7, 17);

v15-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v15-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From a2d6c95bf83659cd3e8fa61222eb49945860c057 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v15] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with any out-of-core
amcanorder-based index AMs that coevolved with nbtree.  Such an index AM
could have had similar limitations around SOAP execution, and so could
have come to rely on the planner workarounds removed by this commit.
Although it seems unlikely that such an index AM really exists, it still
warrants a pro forma compatibility item in the release notes.

Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   51 +-
 src/backend/access/nbtree/nbtree.c         |  114 +-
 src/backend/access/nbtree/nbtsearch.c      |  192 +-
 src/backend/access/nbtree/nbtutils.c       | 2725 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   90 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   13 +
 src/test/regress/expected/create_index.out |   33 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   12 +-
 10 files changed, 2812 insertions(+), 545 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052..62088f140 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,7 +960,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1022,9 +1022,8 @@ typedef BTScanPosData *BTScanPos;
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
-	int			scan_key;		/* index of associated key in arrayKeyData */
+	int			scan_key;		/* index of associated key in keyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1037,14 +1036,14 @@ typedef struct BTScanOpaqueData
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
 	/* workspace for SK_SEARCHARRAY support */
-	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
-	int			numArrayKeys;	/* number of equality-type array keys (-1 if
-								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	int			numArrayKeys;	/* number of equality-type array keys */
+	ScanDirection primScanDir;	/* Scan direction for most recent _bt_first */
+	bool		scanBehind;		/* Scan might be behind arrays? */
+	bool		needPrimScan;	/* Need primscan to continue in primScanDir? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for all equality-type keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1074,26 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	finaltup;		/* Needed by scans with array keys */
+
+	/* Output parameter, set by _bt_checkkeys for _bt_readpage */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/*
+	 * Input and output parameters, set and unset by both _bt_readpage and
+	 * _bt_checkkeys to manage precheck optimizations
+	 */
+	bool		prechecked;		/* precheck set continuescan? */
+	bool		firstmatch;		/* at least one match so far?  */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1152,7 +1171,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1243,15 +1262,11 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_preprocess_array_keys(IndexScanDesc scan);
+extern bool _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+						  IndexTuple tuple, int tupnatts);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 41df1027d..7a6d18777 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -46,8 +46,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -67,8 +67,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -204,21 +204,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
-	/*
-	 * If we have any array keys, initialize them during first call for a
-	 * scan.  We can't do this in btrescan because we don't know the scan
-	 * direction at that time.
-	 */
-	if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return false;
-
-		_bt_start_array_keys(scan, dir);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -260,8 +246,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
 	return res;
 }
@@ -276,19 +262,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
-	/*
-	 * If we have any array keys, initialize them.
-	 */
-	if (so->numArrayKeys)
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return ntids;
-
-		_bt_start_array_keys(scan, ForwardScanDirection);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -318,8 +292,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -348,10 +322,12 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	else
 		so->keyData = NULL;
 
-	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->primScanDir = NoMovementScanDirection;
+	so->scanBehind = false;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -391,7 +367,11 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->numArrayKeys = 0;
+	so->primScanDir = NoMovementScanDirection;
+	so->scanBehind = false;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -425,9 +405,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
-
-	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
-	_bt_preprocess_array_keys(scan);
 }
 
 /*
@@ -455,7 +432,7 @@ btendscan(IndexScanDesc scan)
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
-	/* so->arrayKeyData and so->arrayKeys are in arrayContext */
+	/* so->arrayKeys is in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
 	if (so->killedItems != NULL)
@@ -490,10 +467,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -504,10 +477,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -546,6 +515,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Reset the scan's array keys (see _bt_steppage for why) */
+			if (so->numArrayKeys)
+				_bt_start_array_keys(scan, so->primScanDir);
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -572,7 +544,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -598,7 +570,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -609,7 +581,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -640,16 +616,16 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  The top-level index
+			 * scan might require additional primitive index scans.
 			 */
 			status = false;
 		}
@@ -681,9 +657,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -717,12 +696,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_start_prim_scan.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -736,14 +714,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -752,13 +730,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 23e723a23..be201652c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,11 +907,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
 
+	if (so->numArrayKeys)
+	{
+		if (ScanDirectionIsNoMovement(so->primScanDir))
+		{
+			/*
+			 * First primitive index scan (for current btrescan).
+			 *
+			 * Initialize arrays, and the corresponding scan keys that were
+			 * just output by _bt_preprocess_keys.
+			 */
+			_bt_start_array_keys(scan, dir);
+		}
+		else
+		{
+			/*
+			 * Just stick with the array keys set by _bt_checkkeys at the end
+			 * of the previous primitive index scan.
+			 *
+			 * Note: The initial primitive index scan's _bt_preprocess_keys
+			 * call actually outputs new keys.  Later calls are just no-ops.
+			 * We're just here to build an insertion scan key using values
+			 * already set in so->keyData[] by _bt_checkkeys.
+			 */
+		}
+		so->primScanDir = dir;
+	}
+
 	/*
 	 * For parallel scans, get the starting page from shared state. If the
 	 * scan has not started, proceed to find out first leaf page in the usual
@@ -1527,11 +1554,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
-	bool		continuescanPrechecked;
-	bool		haveFirstMatch = false;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1551,8 +1577,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
+	pstate.dir = dir;
+	pstate.finaltup = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.prechecked = false;
+	pstate.firstmatch = false;
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	arrayKeys = so->numArrayKeys != 0;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1598,10 +1630,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * corresponding value from the last item on the page.  So checking with
 	 * the last item on the page would give a more precise answer.
 	 *
-	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * We skip this for the scan's first page to avoid slowing down point
+	 * queries.  We also have to avoid applying the optimization in rare cases
+	 * where it's not yet clear that the scan is at or ahead of its current
+	 * array keys.  If we're behind, but not too far behind (the start of
+	 * tuples matching the current keys is somewhere before the last item),
+	 * then the optimization is unsafe.
+	 *
+	 * Cases with multiple distinct sets of required array keys for key space
+	 * from the same leaf page can _attempt_ to use the precheck optimization,
+	 * though.  It won't work out, but there's no better way of figuring that
+	 * out than just optimistically attempting the precheck.
+	 *
+	 * The array keys safety issue is related to our reliance on _bt_first
+	 * passing us an offnum that's exactly at the beginning of where equal
+	 * tuples are to be found.  The underlying problem is that we have no
+	 * built-in ability to tell the difference between the start of required
+	 * equality matches and the end of required equality matches.  Array key
+	 * advancement within _bt_checkkeys has to act as a "_bt_first surrogate"
+	 * whenever the start of tuples matching the next set of array keys is
+	 * close to the end of tuples matching the current/last set of array keys.
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !so->scanBehind && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1660,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Do the precheck, while avoiding advancing the scan's array keys
+		 * prematurely
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, false, itup, indnatts);
+		pstate.prechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,23 +1702,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1696,7 +1739,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1756,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			pstate.prechecked = false;	/* precheck didn't cover HIKEY */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1777,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys && minoff <= maxoff && !P_LEFTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1772,23 +1824,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
-			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
-			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1824,7 +1866,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1970,6 +2012,24 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				   so->currPos.nextTupleOffset);
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
+
+		/*
+		 * When a mark used by a scan with array keys is restored, it must be
+		 * forced to visit the next sibling page, even when we expect to skip
+		 * over it now.
+		 *
+		 * This approach allows btrestrpos to just reset the scan's array
+		 * keys, and allow the next call to _bt_readpage to advance the array
+		 * keys to appropriate values.
+		 */
+		Assert(!so->numArrayKeys || dir == so->primScanDir);
+		if (so->needPrimScan)
+		{
+			if (ScanDirectionIsForward(dir))
+				so->markPos.moreRight = true;
+			else
+				so->markPos.moreLeft = true;
+		}
 	}
 
 	if (ScanDirectionIsForward(dir))
@@ -2072,6 +2132,13 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
+
+			if (so->numArrayKeys)
+			{
+				so->primScanDir = dir;
+				so->needPrimScan = false;
+			}
+
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
@@ -2152,6 +2219,13 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				return false;
 			}
 
+			if (so->numArrayKeys)
+			{
+				so->primScanDir = dir;
+				so->scanBehind = false; /* Only needed in forward direction */
+				so->needPrimScan = false;
+			}
+
 			/* Step to next physical page */
 			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
 
@@ -2530,8 +2604,20 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 static inline void
 _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 {
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
+	/*
+	 * Initialize moreLeft/moreRight for this primitive index scan.
+	 *
+	 * In general, scans that have array keys might still have matches on
+	 * pages in the direction opposite dir, the scan's current scan direction.
+	 * When we're called, the top-level scan often won't be at the start of
+	 * its first primitive index scan.
+	 */
+	if (so->numArrayKeys)
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = true;
+	}
+	else if (ScanDirectionIsForward(dir))
 	{
 		so->currPos.moreLeft = false;
 		so->currPos.moreRight = true;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d50317096..b401b3119 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -32,23 +32,65 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *sortproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
 
+typedef struct BTScanKeyPreproc
+{
+	ScanKey		skey;
+	int			ikey;
+	int			arrayidx;
+} BTScanKeyPreproc;
+
+static void _bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+								FmgrInfo *orderproc, FmgrInfo **sortprocp);
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
-									  StrategyNumber strat,
+									  Oid elemtype, StrategyNumber strat,
 									  Datum *elems, int nelems);
-static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-									bool reverse,
-									Datum *elems, int nelems);
+static int	_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc,
+									bool reverse, Datum *elems, int nelems);
+static int	_bt_merge_arrays(ScanKey skey, FmgrInfo *mergeprocp, bool reverse,
+							 Datum *elems_orig, int nelems_orig,
+							 Datum *elems_next, int nelems_next);
+static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
+										   ScanKey arraysk, ScanKey skey,
+										   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+										   bool *qual_ok);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_start, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+										 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+										 bool readpagetup, int sktrig, bool *scanBehind);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+								   int sktrig);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
+									 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool arrayKeys, bool prechecked, bool firstmatch,
+							  bool *continuescan, int *ikey);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -188,29 +230,54 @@ _bt_freestack(BTStack stack)
  *
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
- * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * Return modified scan keys as input for further, standard preprocessing.
  *
- * Note: the reason we need so->arrayKeyData, rather than just scribbling
- * on scan->keyData, is that callers are permitted to call btrescan without
- * supplying a new set of scankey data.
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array element as redundant.  Similarly, we are capable of
+ * "merging together" multiple equality array keys (from two or more input
+ * scan keys) into a single output scan key containing only the intersecting
+ * array elements.  This can eliminate many redundant array elements, as well
+ * as eliminating whole array scan keys as redundant.  It can also allow us to
+ * detect contradictory quals.
+ *
+ * It is convenient for _bt_preprocess_keys caller to have to deal with no
+ * more than one equality strategy array scan key per index attribute.  We'll
+ * always be able to set things up that way when complete opfamilies are used.
+ * Eliminated array scan keys can be recognized as those that have had their
+ * sk_strategy field set to InvalidStrategy here by us.  Caller should avoid
+ * including these in the scan's so->keyData[] output array.
+ *
+ * We set the scan key references from the scan's BTArrayKeyInfo info array to
+ * offsets into the temp modified input array returned to caller.  Scans that
+ * have array keys should call _bt_preprocess_array_keys_final when standard
+ * preprocessing steps are complete.  This will convert the scan key offset
+ * references into references to the scan's so->keyData[] output scan keys.
+ *
+ * Note: the reason we need to return a temp scan key array, rather than just
+ * scribbling on scan->keyData, is that callers are permitted to call btrescan
+ * without supplying a new set of scankey data.
  */
-void
+static ScanKey
 _bt_preprocess_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
 	int			numberOfKeys = scan->numberOfKeys;
-	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16	   *indoption = rel->rd_indoption;
 	int			numArrayKeys;
+	int			prevArrayAtt = -1;
+	Oid			prevElemtype = InvalidOid;
 	ScanKey		cur;
-	int			i;
 	MemoryContext oldContext;
+	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+
+	Assert(numberOfKeys && ScanDirectionIsNoMovement(so->primScanDir));
 
 	/* Quick check to see if there are any array keys */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
 		cur = &scan->keyData[i];
 		if (cur->sk_flags & SK_SEARCHARRAY)
@@ -220,20 +287,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			/* If any arrays are null as a whole, we can quit right now. */
 			if (cur->sk_flags & SK_ISNULL)
 			{
-				so->numArrayKeys = -1;
-				so->arrayKeyData = NULL;
-				return;
+				so->qual_ok = false;
+				return NULL;
 			}
 		}
 	}
 
 	/* Quit if nothing to do. */
 	if (numArrayKeys == 0)
-	{
-		so->numArrayKeys = 0;
-		so->arrayKeyData = NULL;
-		return;
-	}
+		return NULL;
 
 	/*
 	 * Make a scan-lifespan context to hold array-associated data, or reset it
@@ -249,18 +311,23 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	oldContext = MemoryContextSwitchTo(so->arrayContext);
 
 	/* Create modifiable copy of scan->keyData in the workspace context */
-	so->arrayKeyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-	memcpy(so->arrayKeyData,
-		   scan->keyData,
-		   scan->numberOfKeys * sizeof(ScanKeyData));
+	arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
+	memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
 
 	/* Allocate space for per-array data in the workspace context */
-	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
+
+	/* Allocate space for ORDER procs used to help _bt_checkkeys */
+	so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
+		FmgrInfo	sortproc;
+		FmgrInfo   *sortprocp = &sortproc;
+		Oid			elemtype;
+		bool		reverse;
 		ArrayType  *arrayval;
 		int16		elmlen;
 		bool		elmbyval;
@@ -271,7 +338,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			num_nonnulls;
 		int			j;
 
-		cur = &so->arrayKeyData[i];
+		cur = &arrayKeyData[i];
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -305,10 +372,21 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/* If there's no non-nulls, the scan qual is unsatisfiable */
 		if (num_nonnulls == 0)
 		{
-			numArrayKeys = -1;
+			so->qual_ok = false;
 			break;
 		}
 
+		/*
+		 * Determine the nominal datatype of the array elements.  We have to
+		 * support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this is a hack to simplify life for
+		 * ScanKeyInit().
+		 */
+		elemtype = cur->sk_subtype;
+		if (elemtype == InvalidOid)
+			elemtype = rel->rd_opcintype[cur->sk_attno - 1];
+		Assert(elemtype == ARR_ELEMTYPE(arrayval));
+
 		/*
 		 * If the comparison operator is not equality, then the array qual
 		 * degenerates to a simple comparison against the smallest or largest
@@ -319,7 +397,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTLessStrategyNumber:
 			case BTLessEqualStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTGreaterStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -329,7 +407,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTGreaterEqualStrategyNumber:
 			case BTGreaterStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTLessStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -339,27 +417,392 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 				break;
 		}
 
+		/*
+		 * Attributes with equality-type scan keys will need a 3-way ORDER
+		 * proc to perform binary searches for the next matching array
+		 * element.  Set that up now.
+		 *
+		 * Array scan keys with cross-type equality operators will require a
+		 * separate same-type ORDER proc for sorting their array.  Otherwise,
+		 * sortproc just points to the same proc used during binary searches.
+		 *
+		 * Note: we'll also need a 3-way ORDER proc for any non-array equality
+		 * strategy scan keys.  That doesn't happen until preprocessing
+		 * reaches _bt_preprocess_array_keys_final, to avoid looking up ORDER
+		 * procs needlessly for scan keys that could be eliminated.
+		 */
+		_bt_setup_array_cmp(scan, cur, elemtype,
+							&so->orderProcs[i], &sortprocp);
+
 		/*
 		 * Sort the non-null elements and eliminate any duplicates.  We must
 		 * sort in the same ordering used by the index column, so that the
-		 * successive primitive indexscans produce data in index order.
+		 * arrays can be advanced in lockstep with the scan's progress through
+		 * the index's key space.
 		 */
-		num_elems = _bt_sort_array_elements(scan, cur,
-											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+		reverse = (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0;
+		num_elems = _bt_sort_array_elements(cur, sortprocp, reverse,
 											elem_values, num_nonnulls);
 
+		/*
+		 * If this scan key is semantically equivalent to a previous equality
+		 * operator array scan key, merge the two arrays together to eliminate
+		 * redundant non-intersecting elements (and whole scan keys).
+		 */
+		if (prevArrayAtt == cur->sk_attno)
+		{
+			FmgrInfo   *mergeprocp = NULL;
+			FmgrInfo	crosstype_orderproc;
+
+			if (elemtype == prevElemtype)
+			{
+				/*
+				 * This array is the same type as an earlier redundant array
+				 * against the same index attribute.  We can merge the two
+				 * arrays together using our existing sort ORDER proc (which
+				 * is often also the ORDER proc stored in so->orderProcs[]).
+				 *
+				 * Note: we do this whenever the array element type matches,
+				 * which it may even when a cross-type equality operator was
+				 * used (we do need a separate sortproc in cases involving
+				 * cross-type operators, though).
+				 */
+				mergeprocp = sortprocp;
+			}
+			else
+			{
+				/*
+				 * Cross-type merging requires another ORDER proc lookup.
+				 *
+				 * We need to tolerate it when there is a lack of available
+				 * cross-type support.
+				 */
+				RegProcedure cmp_proc;
+
+				cmp_proc = get_opfamily_proc(rel->rd_opfamily[cur->sk_attno - 1],
+											 prevElemtype, elemtype, BTORDER_PROC);
+				if (RegProcedureIsValid(cmp_proc))
+				{
+					fmgr_info_cxt(cmp_proc, &crosstype_orderproc, so->arrayContext);
+					mergeprocp = &crosstype_orderproc;
+				}
+			}
+
+			if (mergeprocp)
+			{
+				BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+				Assert(arrayKeyData[prev->scan_key].sk_attno == cur->sk_attno);
+				Assert(arrayKeyData[prev->scan_key].sk_collation ==
+					   cur->sk_collation);
+
+				num_elems = _bt_merge_arrays(cur, mergeprocp, reverse,
+											 prev->elem_values, prev->num_elems,
+											 elem_values, num_elems);
+
+				pfree(elem_values);
+
+				/*
+				 * If there are no intersecting elements left from merging
+				 * this array into the previous array on the same attribute,
+				 * the scan qual is unsatisfiable
+				 */
+				if (num_elems == 0)
+				{
+					so->qual_ok = false;
+					break;
+				}
+
+				/*
+				 * Lower the number of elements from the previous array.  This
+				 * scan key/array is redundant.  Dealing with that is
+				 * finalized within _bt_preprocess_keys.
+				 */
+				prev->num_elems = num_elems;
+				cur->sk_strategy = InvalidStrategy; /* for _bt_preprocess_keys */
+				continue;
+			}
+
+			/*
+			 * Unable to merge this array with previous array due to a lack of
+			 * suitable cross-type opfamily support.  Will need to keep both
+			 * arrays.
+			 */
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
+		 *
+		 * Note: _bt_preprocess_array_keys_final will fix-up each array's
+		 * scan_key field later on, after so->keyData[] is finalized.
 		 */
 		so->arrayKeys[numArrayKeys].scan_key = i;
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
 		so->arrayKeys[numArrayKeys].elem_values = elem_values;
 		numArrayKeys++;
+		prevArrayAtt = cur->sk_attno;
+		prevElemtype = elemtype;
 	}
 
 	so->numArrayKeys = numArrayKeys;
 
 	MemoryContextSwitchTo(oldContext);
+
+	return arrayKeyData;
+}
+
+/*
+ *
+ *	_bt_preprocess_array_keys_final() -- fix up array scan key references
+ *
+ * When _bt_preprocess_array_keys performed initial array preprocessing, it
+ * set each array's array->scan_key to the array's arrayKeys[] entry offset
+ * (that also work as references into the original scan->keyData[] array).
+ * This function handles translation of the scan key references from the
+ * BTArrayKeyInfo info array, from input scan key references (to the keys in
+ * scan->keyData[]), into output references (to the keys in so->keyData[]).
+ * Caller's keyDataMap[] array tells us how to perform this remapping.
+ *
+ * Also reorders so->orderProcs[] entries in-place.  This is required for all
+ * remaining equality-type scan keys (not just for those with an array).
+ *
+ * We'll also convert single-element array scan keys into equivalent non-array
+ * equality scan keys, which decrements so->numArrayKeys.  It's possible that
+ * this will leave this new btrescan without any arrays at all.  This isn't
+ * necessary for correctness; it's just an optimization.  Non-array equality
+ * scan keys are slightly faster than equivalent array scan keys at runtime.
+ *
+ * Note: _bt_compare_array_scankey_args always eliminates non-array equality
+ * scan keys that are redundant with some other array equality scan key (just
+ * like it will with any other type of non-array scan key).  Our "convert any
+ * single element array to a non-array scan key" optimization is therefore the
+ * only way that preprocessing can leave behind a non-array equality scan key
+ * (for index attributes with a partly-redundant array equality scan key).
+ */
+static void
+_bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	int			arrayidx = 0;
+	int			last_equal_output_ikey PG_USED_FOR_ASSERTS_ONLY = -1;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys);
+
+	for (int output_ikey = 0; output_ikey < so->numberOfKeys; output_ikey++)
+	{
+		ScanKey		outkey = so->keyData + output_ikey;
+		int			input_ikey;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+
+		Assert(outkey->sk_strategy != InvalidStrategy);
+
+		if (outkey->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		input_ikey = keyDataMap[output_ikey];
+
+		/*
+		 * Make it safe for _bt_advance_array_keys to subscript
+		 * so->orderProcs[] using simple so->keyData[]-wise offsets.
+		 *
+		 * It's safe to reorder so->orderProcs[] in-place.  Our caller is
+		 * required to output equality scan keys in their original input order
+		 * (which is also the order that arrays appear in).
+		 */
+		Assert(last_equal_output_ikey < output_ikey);
+		Assert(last_equal_output_ikey < input_ikey);
+		last_equal_output_ikey = output_ikey;
+
+		/*
+		 * We don't actually have to reorder so->orderProcs[] for non-array
+		 * equality scan keys, though -- they don't have a valid entry yet.
+		 *
+		 * We're lazy about looking up ORDER procs for non-array scan keys,
+		 * since not all input scan keys become output scan keys.
+		 */
+		if (!(outkey->sk_flags & SK_SEARCHARRAY))
+		{
+			Oid			elemtype;
+
+			/* No point in ORDER proc lookup for IS NULL scan keys */
+			if (outkey->sk_flags & SK_SEARCHNULL)
+				continue;
+
+			/*
+			 * Lookup this equality key's ORDER proc.  _bt_advance_array_keys
+			 * will treat it as a degenerate single value array, so we can't
+			 * get away with just using the equality operator on its own.
+			 */
+			elemtype = outkey->sk_subtype;
+			if (elemtype == InvalidOid)
+				elemtype = rel->rd_opcintype[outkey->sk_attno - 1];
+
+			_bt_setup_array_cmp(scan, outkey, elemtype,
+								&so->orderProcs[output_ikey], NULL);
+			continue;
+		}
+
+		/* Reorder so->orderProcs[] in-place for arrays */
+		so->orderProcs[output_ikey] = so->orderProcs[input_ikey];
+
+		/* Also fix-up  array->scan_key references for arrays */
+		for (; arrayidx < so->numArrayKeys; arrayidx++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
+
+			Assert(array->num_elems > 0);
+
+			if (array->scan_key == input_ikey)
+			{
+				/* found it */
+				array->scan_key = output_ikey;
+				found = true;
+
+				/*
+				 * Transform array scan keys that have exactly 1 element
+				 * remaining (following all prior preprocessing) into
+				 * equivalent non-array scan keys.
+				 */
+				if (array->num_elems == 1)
+				{
+					outkey->sk_flags &= ~SK_SEARCHARRAY;
+					outkey->sk_argument = array->elem_values[0];
+					so->numArrayKeys--;
+
+					/* If we're out of array keys, we can quit right away */
+					if (so->numArrayKeys == 0)
+						return;
+
+					/* Shift other arrays forward */
+					memmove(array, array + 1,
+							sizeof(BTArrayKeyInfo) *
+							(so->numArrayKeys - arrayidx));
+
+					/*
+					 * Don't increment arrayidx (there was an entry that was
+					 * just shifted forward to the offset at arrayidx, which
+					 * will still need to be matched)
+					 */
+				}
+				else
+				{
+					/* Match found, so done with this array */
+					arrayidx++;
+				}
+
+				break;
+			}
+		}
+
+		Assert(found);
+	}
+}
+
+/*
+ * _bt_setup_array_cmp() -- Set up array comparison functions
+ *
+ * Sets ORDER proc in caller's orderproc argument, which is used during binary
+ * searches of arrays during the index scan.  Also sets a same-type ORDER proc
+ * in caller's *sortprocp argument.
+ *
+ * Caller should pass an orderproc pointing to space that'll store the ORDER
+ * proc for the scan, and a *sortprocp pointing to its own separate space.
+ *
+ * In the common case where we don't need to deal with cross-type operators,
+ * only one ORDER proc is actually required by caller.  We'll set *sortprocp
+ * to point to the same memory that caller's orderproc continues to point to.
+ * Otherwise, *sortprocp will continue to point to separate memory, which
+ * we'll initialize separately (with an "(elemtype, elemtype)" ORDER proc that
+ * can be used to sort arrays).
+ *
+ * Preprocessing calls here with all equality strategy scan keys, including
+ * those not asociated with any array.  See _bt_advance_array_keys for an
+ * explanation of why we need to treat these as degenerate single-value arrays
+ * when the scan advances its arrays.
+ */
+static void
+_bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+					FmgrInfo *orderproc, FmgrInfo **sortprocp)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	RegProcedure cmp_proc;
+	Oid			opclasstype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
+
+	/*
+	 * If scankey operator is not a cross-type comparison, we can use the
+	 * cached comparison function; otherwise gotta look it up in the catalogs
+	 */
+	if (elemtype == opclasstype)
+		*orderproc = *index_getprocinfo(rel, skey->sk_attno, BTORDER_PROC);
+	else
+	{
+		/*
+		 * Have to look up the appropriate comparison function in the opfamily
+		 * the hard way.
+		 *
+		 * Must use the opclass type as the left hand arg type, and the array
+		 * element as the right hand arg type (since binary searches search
+		 * for the array value that best matches the next index tuple' value).
+		 *
+		 * Note: it's possible that this would fail, if the opfamily lacks the
+		 * required cross-type ORDER proc.  But this is no different to the
+		 * case where _bt_first fails to find an ORDER proc for its insertion
+		 * scan key.
+		 */
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+									 opclasstype, elemtype, BTORDER_PROC);
+		if (!RegProcedureIsValid(cmp_proc))
+			elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+				 BTORDER_PROC, opclasstype, elemtype,
+				 skey->sk_attno, RelationGetRelationName(rel));
+
+		/* Set ORDER proc for caller */
+		fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+	}
+
+	if (!sortprocp)
+	{
+		/*
+		 * Nothing more to do for caller that doesn't actually have an array
+		 * to sort
+		 */
+		return;
+	}
+
+	if (elemtype == opclasstype)
+	{
+		/*
+		 * A second opfamily support proc lookup can be avoided in the common
+		 * case where the ORDER proc used for the scan's binary searches uses
+		 * the opclass/on-disk datatype for both its left and right arguments.
+		 */
+		*sortprocp = orderproc;
+		return;
+	}
+
+	/*
+	 * Look up the appropriate same-type comparison function in the opfamily.
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but it seems quite unlikely that an opfamily would omit
+	 * non-cross-type support functions for any datatype that it supports at
+	 * all.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 elemtype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, elemtype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set same-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, *sortprocp, so->arrayContext);
 }
 
 /*
@@ -370,27 +813,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
  * least element, or BTGreaterStrategyNumber to get the greatest.
  */
 static Datum
-_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
+_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey, Oid elemtype,
 						 StrategyNumber strat,
 						 Datum *elems, int nelems)
 {
 	Relation	rel = scan->indexRelation;
-	Oid			elemtype,
-				cmp_op;
+	Oid			cmp_op;
 	RegProcedure cmp_proc;
 	FmgrInfo	flinfo;
 	Datum		result;
 	int			i;
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
 	/*
 	 * Look up the appropriate comparison operator in the opfamily.
 	 *
@@ -399,6 +832,8 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	 * non-cross-type comparison operators for any datatype that it supports
 	 * at all.
 	 */
+	Assert(skey->sk_strategy != BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
 	cmp_op = get_opfamily_member(rel->rd_opfamily[skey->sk_attno - 1],
 								 elemtype,
 								 elemtype,
@@ -433,50 +868,21 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
  * The array elements are sorted in-place, and the new number of elements
  * after duplicate removal is returned.
  *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * skey identifies the index column whose opfamily determines the comparison
+ * semantics, and sortproc is a corresponding ORDER proc.  If reverse is true,
+ * we sort in descending order.
  */
 static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
+_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -487,6 +893,196 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys should have arrays whose elements that have already been
+ * sorted and deduplicated.
+ *
+ * mergeprocp is an ORDER proc whose left hand argument accepts datums of
+ * whatever type the original array's elements are, and whose right hand type
+ * is whatever type the next array's elements are.
+ *
+ * Merging reorganizes caller's original array (the left hand arg) in-place,
+ * while avoiding copying over elements taken from the next array.  This is
+ * strictly necessary when the arrays don't share the same element types.
+ */
+static int
+_bt_merge_arrays(ScanKey skey, FmgrInfo *mergeprocp, bool reverse,
+				 Datum *elems_orig, int nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	BTSortArrayContext cxt;
+	int			merged_nelems = 0;
+
+	cxt.sortproc = mergeprocp;
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+	for (int i = 0, j = 0; i < nelems_orig && j < nelems_next;)
+	{
+		Datum	   *oelem = elems_orig + i;
+		Datum	   *nelem = elems_next + j;
+		int			res;
+
+		res = _bt_compare_array_elements(oelem, nelem, &cxt);
+
+		if (res == 0)
+		{
+			elems_orig[merged_nelems] = *oelem;
+			merged_nelems++;
+			i++;
+			j++;
+		}
+		else if (res < 0)
+			i++;
+		else					/* res > 0 */
+			j++;
+	}
+
+	return merged_nelems;
+}
+
+/*
+ * Compare a scalar scankey to an array, excluding redundant array elements.
+ *
+ * Array elements can be eliminated as redundant when excluded by some other
+ * operator against the same attribute.  For example, with a qual "WHERE a IN
+ * (1, 2, 3) AND a < 2", all array elements except the value "1" are redundant
+ * (the < scan key can be treated as redundant to the scan, too).
+ *
+ * If the opfamily doesn't supply a complete set of cross-type ORDER procs we
+ * may not be able to determine which elements are redundant.  If we have the
+ * required ORDER proc then we'll return true (and we'll set *qual_ok).  We
+ * return false if the comparison could not be made (caller should ignore the
+ * value we set *qual_ok to when this happens).
+ *
+ * Note: whenever we return true, we guarantee that skey (which is caller's
+ * non-array scan key) has been eliminated, at a minimum.  We may have also
+ * managed to eliminate some subset of the array elements (array is actually
+ * modified for caller when this happens), or all of the array's elements,
+ * which means that the scan has an unsatisfiable qual (in which case we'll
+ * also have set *qual_ok=false for caller).
+ */
+static bool
+_bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+							   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+							   bool *qual_ok)
+{
+	Relation	rel = scan->indexRelation;
+	Oid			opclasstype = rel->rd_opcintype[arraysk->sk_attno - 1];
+	int			matchelem;
+	int			cmpresult = 0;
+	int			new_nelems = 0;
+	Oid			skey_type;
+	FmgrInfo	nonmatch_orderproc;
+	FmgrInfo   *usable_orderprocp;
+	int			cmpval;
+
+	Assert(arraysk->sk_attno == skey->sk_attno);
+
+	Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
+		   arraysk->sk_strategy == BTEqualStrategyNumber);
+
+	Assert(!(skey->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert(!(skey->sk_flags & SK_SEARCHARRAY) ||
+		   skey->sk_strategy != BTEqualStrategyNumber);
+
+	Assert(array->num_elems > 0);
+
+	/*
+	 * _bt_binsrch_array_skey is designed to search scan key arrays using
+	 * datums of whatever type the relevant rel opclass uses (on-disk type).
+	 *
+	 * We can reuse the array's ORDER proc whenever the non-array scan key's
+	 * type is a match for the corresponding index attribute's opclass type.
+	 * Otherwise, we have to do another ORDER proc lookup.
+	 *
+	 * Note: we do not yet have any usable ORDER procs available for non-array
+	 * equality scan keys.  _bt_preprocess_array_keys_final will look those up
+	 * later on.  (We put this work off precisely because it'll waste cycles
+	 * when we can determine that the non-array equality key is redundant.)
+	 */
+	skey_type = skey->sk_subtype;
+	if (skey_type == InvalidOid)
+		skey_type = rel->rd_opcintype[skey->sk_attno - 1];
+	if (skey_type != opclasstype)
+	{
+		RegProcedure cmp_proc;
+		Oid			arraysk_elemtype;
+
+		/*
+		 * Need an ORDER proc lookup to detect redundancy/contradictoriness
+		 * with this pair of scankeys
+		 */
+		usable_orderprocp = &nonmatch_orderproc;
+		arraysk_elemtype = arraysk->sk_subtype;
+		if (arraysk_elemtype == InvalidOid)
+			arraysk_elemtype = rel->rd_opcintype[arraysk->sk_attno - 1];
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[arraysk->sk_attno - 1],
+									 skey_type, arraysk_elemtype,
+									 BTORDER_PROC);
+		if (RegProcedureIsValid(cmp_proc))
+			fmgr_info(cmp_proc, usable_orderprocp);
+		else
+		{
+			/* Can't make the comparison */
+			*qual_ok = false;	/* suppress compiler warnings */
+			return false;
+		}
+	}
+	else
+		usable_orderprocp = orderproc;
+
+	matchelem = _bt_binsrch_array_skey(usable_orderprocp, false,
+									   ForwardScanDirection,
+									   skey->sk_argument, false, array,
+									   arraysk, &cmpresult);
+
+	if (skey->sk_strategy == BTLessStrategyNumber ||
+		skey->sk_strategy == BTLessEqualStrategyNumber)
+	{
+		/* Resize array downwards as needed */
+		cmpval = skey->sk_strategy == BTLessEqualStrategyNumber ? 0 : 1;
+		if (cmpresult >= cmpval)
+			matchelem++;
+		new_nelems = matchelem;
+	}
+	else if (skey->sk_strategy == BTEqualStrategyNumber)
+	{
+		if (cmpresult != 0)
+			new_nelems = 0;
+		else
+		{
+			/* Move matching element to front of array, then resize to 1 */
+			array->elem_values[0] = array->elem_values[matchelem];
+			new_nelems = 1;
+		}
+	}
+	else if (skey->sk_strategy == BTGreaterEqualStrategyNumber ||
+			 skey->sk_strategy == BTGreaterStrategyNumber)
+	{
+		/* Shift all matching elements to the start of the array, resize */
+		cmpval = skey->sk_strategy == BTGreaterStrategyNumber ? 0 : 1;
+		if (cmpresult >= cmpval)
+			matchelem++;
+		new_nelems = array->num_elems - matchelem;
+		memmove(array->elem_values,
+				array->elem_values + matchelem,
+				sizeof(Datum) * new_nelems);
+	}
+	else
+		Assert(false);
+
+	Assert(new_nelems >= 0);
+	Assert(new_nelems <= array->num_elems);
+
+	array->num_elems = new_nelems;
+
+	*qual_ok = new_nelems > 0;
+	return true;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -498,7 +1094,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->sortproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -506,11 +1102,160 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound for backward
+ * scans).  It's safe for searches against required scan key arrays to reuse
+ * earlier search bounds like this because such arrays always advance in
+ * lockstep with the index scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_start, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem = 0,
+				mid_elem = -1,
+				high_elem = array->num_elems - 1,
+				result = 0;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur_elem_start)
+	{
+		if (ScanDirectionIsForward(dir))
+			low_elem = array->cur_elem;
+		else
+			high_elem = array->cur_elem;
+	}
+
+	while (high_elem > low_elem)
+	{
+		Datum		arrdatum;
+
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * It's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the low_elem datum.  Must always set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
  * Set up the cur_elem counters and fill in the first sk_argument value for
- * each array scankey.  We can't do this until we know the scan direction.
+ * each array scankey.
  */
 void
 _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -518,159 +1263,1158 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			i;
 
+	Assert(so->numArrayKeys);
+	Assert(so->qual_ok);
+	Assert(!ScanDirectionIsNoMovement(dir));
+
 	for (i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 
 		Assert(curArrayKey->num_elems > 0);
+		Assert(skey->sk_flags & SK_SEARCHARRAY);
+
 		if (ScanDirectionIsBackward(dir))
 			curArrayKey->cur_elem = curArrayKey->num_elems - 1;
 		else
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
+	 *
+	 * We rely on the assumption that array preprocessing always leaves behind
+	 * exactly the arrays needed within the scan's preprocessed so->keyData[]
+	 * scan keys (needed for the current btrescan).  It is still possible that
+	 * there will be more than one array per index attribute, though only when
+	 * the attribute's opfamily is incomplete.  This is not an exception to
+	 * the general rule about so->keyData[] and arrays: there will still be a
+	 * 1:1 mapping between each array and each array equality scan key.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 		int			cur_elem = curArrayKey->cur_elem;
 		int			num_elems = curArrayKey->num_elems;
+		bool		rolled = false;
 
-		if (ScanDirectionIsBackward(dir))
+		if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
 		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = 0;
+			rolled = true;
 		}
-		else
+		else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
 		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = num_elems - 1;
+			rolled = true;
 		}
 
 		curArrayKey->cur_elem = cur_elem;
 		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
+		if (!rolled)
+			return true;
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+		/* Need to advance next array key, if any */
+	}
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * The array keys are now exhausted.
+	 *
+	 * There isn't actually a distinct state that represents array exhaustion,
+	 * since index scans don't always end when btgettuple returns "false". The
+	 * scan direction might be reversed, or the scan might yet have its last
+	 * saved position restored.
+	 *
+	 * Restore the array keys to the state they were in immediately before we
+	 * were called.  This ensures that the arrays can only ever ratchet in the
+	 * scan's current direction.  Without this, scans would overlook matching
+	 * tuples if and when the scan's direction was subsequently reversed.
 	 */
-	if (!found)
-		so->arraysStarted = false;
+	_bt_start_array_keys(scan, -dir);
 
-	return found;
+	return false;
 }
 
 /*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
+ * _bt_rewind_nonrequired_arrays() -- Rewind non-required arrays
  *
- * Save the current state of the array keys as the "mark" position.
+ * Called when _bt_advance_array_keys decides to start a new primitive index
+ * scan on the basis of the current scan position being before the position
+ * that _bt_first is capable of repositioning the scan to by applying an
+ * inequality operator required in the opposite-to-scan direction only.
+ *
+ * Although equality strategy scan keys (for both arrays and non-arrays alike)
+ * are either marked required in both directions or in neither direction,
+ * there is a sense in which non-required arrays behave like required arrays.
+ * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
+ * the scan key on "c" is non-required, but nevertheless enables positioning
+ * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
+ * first descent of the tree by _bt_first.  Later on, there could also be a
+ * second descent, that places the scan right before tuples >= "(200, 3, 5)".
+ * _bt_first must never be allowed to build an insertion scan key whose "c"
+ * entry is set to a value other than 5, the "c" array's first element/value.
+ * (Actually, it's the first in the current scan direction.  This example uses
+ * a forward scan.)
+ *
+ * Calling here resets the array scan key elements for the scan's non-required
+ * arrays.  This is strictly necessary for correctness in a subset of cases
+ * involving "required in opposite direction"-triggered primitive index scans.
+ * Not all callers are at risk of _bt_first using a non-required array like
+ * this, but advancement always resets the arrays when another primitve scan
+ * is scheduled, just to keep things simple.  Array advancement even makes
+ * sure to reset non-required arrays during scans that have no inequalities.
+ * (Advancement still won't call here when there are no inequalities, though
+ * that's just because it's all handled indirectly instead.)
+ *
+ * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
+ * everybody got this right.
  */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
+static void
+_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
+	int			arrayidx = 0;
 
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->primScanDir == dir);
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		Assert(array->scan_key == ikey);
+
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
+			continue;
+
+		if (ScanDirectionIsForward(dir) || !array)
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+		{
+			array->cur_elem = first_elem_dir;
+			cur->sk_argument = array->elem_values[first_elem_dir];
+		}
 	}
 }
 
 /*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
+ * _bt_tuple_before_array_skeys() -- determine if tuple advances array keys
  *
- * Restore the array keys to where they were when the mark was set.
+ * We always compare the tuple using the current array keys (which we assume
+ * are already set in so->keyData[]).  readpagetup indicates if tuple is the
+ * scan's current _bt_readpage-wise tuple.
+ *
+ * readpagetup callers must only call here when _bt_check_compare already set
+ * continuescan=false.  We help these callers deal with _bt_check_compare's
+ * inability to distinguishing between the < and > cases (it uses equality
+ * operator scan keys, whereas we use 3-way ORDER procs).
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This happens to readpagetup callers when tuple is still before the
+ * start of matches for the scan's current required array keys.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This happens to readpagetup callers when the
+ * scan has reached the point of needing its array keys advanced.
+ *
+ * As an optimization, readpagetup callers pass a _bt_check_compare-set sktrig
+ * value to indicate which scan key triggered _bt_checkkeys to recheck with us
+ * (!readpagetup callers must always pass sktrig=0).  This allows us to avoid
+ * wastefully checking earlier scan keys that _bt_check_compare already found
+ * to be satisfied by the current qual/set of array keys.  If sktrig indicates
+ * a non-required array that _bt_check_compare just set continuescan=false for
+ * (see _bt_check_compare for an explanation), then we always return false.
+ *
+ * !readpagetup callers optionally pass us *scanBehind, which tracks whether
+ * any missing truncated attributes might have affected array advancement
+ * (compared to what would happen if it was shown the first non-pivot tuple on
+ * the page to the right of caller's finaltup/high key tuple instead).  It's
+ * only possible that we'll set *scanBehind to true when caller passes us a
+ * pivot tuple (with truncated attributes) that we return false for.
  */
-void
-_bt_restore_array_keys(IndexScanDesc scan)
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+							 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							 bool readpagetup, int sktrig, bool *scanBehind)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		changed = false;
-	int			i;
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->numArrayKeys);
+	Assert(so->numberOfKeys);
+	Assert(!so->needPrimScan);
+	Assert(sktrig == 0 || readpagetup);
+	Assert(!readpagetup || scanBehind == NULL);
+
+	if (scanBehind)
+		*scanBehind = false;
+
+	for (; sktrig < so->numberOfKeys; sktrig++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		ScanKey		cur = so->keyData + sktrig;
+		FmgrInfo   *orderproc;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		/*
+		 * Once we reach a non-required scan key, we're completely done.
+		 *
+		 * Note: we deliberately don't consider the scan direction here.
+		 * _bt_advance_array_keys caller requires that we track *scanBehind
+		 * without concern for scan direction.
+		 */
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) == 0)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
-			changed = true;
+			Assert(!readpagetup || (cur->sk_strategy == BTEqualStrategyNumber &&
+									(cur->sk_flags & SK_SEARCHARRAY)));
+			return false;
+		}
+
+		/* readpagetup calls require one ORDER proc comparison (at most) */
+		Assert(!readpagetup || cur == so->keyData + sktrig);
+
+		if (cur->sk_attno > tupnatts)
+		{
+			Assert(!readpagetup);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys (but set *scanBehind to let interested callers know
+			 * that a truncated attribute might have affected our answer).
+			 */
+			if (scanBehind)
+				*scanBehind = true;
+
+			return false;
+		}
+
+		/*
+		 * Inequality strategy scan keys (that are required in current scan
+		 * direction) can only be evaluated by _bt_check_compare
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+		{
+			/*
+			 * Give up right away when _bt_check_compare indicated that a
+			 * required inequality scan key wasn't satisfied
+			 */
+			if (readpagetup)
+				return false;
+
+			/*
+			 * Otherwise we can't give up, since we must check all required
+			 * scan keys in order to correctly track *scanBehind for caller
+			 */
+			continue;
+		}
+
+		orderproc = &so->orderProcs[sktrig];
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?  (Must be if we get here during a readpagetup call.)
+		 */
+		if (readpagetup || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconclusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or a call made from an assertion.
+		 */
+		Assert(result == 0);
+		Assert(!readpagetup);
+	}
+
+	return false;
+}
+
+/*
+ * _bt_start_prim_scan() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys,
+ * after _bt_first or _bt_next return false.
+ */
+bool
+_bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(so->primScanDir == dir || !so->qual_ok);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  The flag tells us what to do.
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys.  There are
+	 * various cases where that won't happen.  For example, if the index is
+	 * completely empty, then _bt_first won't call _bt_readpage/_bt_checkkeys.
+	 * We also don't expect a call to _bt_checkkeys during searches for a
+	 * non-existent value that happens to be lower/higher than any existing
+	 * value in the index.
+	 *
+	 * We don't require special handling for these cases -- we don't need to
+	 * be explicitly instructed to _not_ perform another primitive index scan.
+	 * It's up to code under the control of _bt_first to always set the flag
+	 * when another primitive index scan will be required.
+	 *
+	 * This works correctly, even with the tricky cases listed above, which
+	 * all involve access to leaf pages "near the boundaries of the key space"
+	 * (whether it's from a leftmost/rightmost page, or an imaginary empty
+	 * leaf root page).  If _bt_checkkeys cannot be reached by a primitive
+	 * index scan for one set of array keys, then it also won't be reached for
+	 * any later set ("later" in terms of the direction that we scan the index
+	 * and advance the arrays).  The array keys won't have advanced in these
+	 * cases, but that's the correct behavior (even _bt_advance_array_keys
+	 * won't always advance the arrays at the point they become "exhausted").
+	 */
+	if (so->needPrimScan)
+	{
+		Assert(_bt_verify_arrays_bt_first(scan, dir));
+
+		/* Flag was set -- must call _bt_first again */
+		so->needPrimScan = false;
+		so->scanBehind = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/* The top-level index scan ran out of tuples in this scan direction */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * The scan always gets a new qual as a consequence of calling here (except
+ * when we determine that the top-level scan has run out of matching tuples).
+ * All later _bt_check_compare calls also use the same new qual that was first
+ * used here (at least until the next call here advances the keys once again).
+ * It's convenient to structure _bt_check_compare rechecks of caller's tuple
+ * (using the new qual) as one the steps of advancing the scan's array keys,
+ * so this function works as a wrapper around _bt_check_compare.
+ *
+ * Like _bt_check_compare, we'll set pstate.continuescan on behalf of the
+ * caller, and return a boolean indicating if caller's tuple satisfies the
+ * scan's new qual.  But unlike _bt_check_compare, we set so->needPrimScan
+ * when we set continuescan=false, indicating if a new primitive index scan
+ * has been scheduled (otherwise, the top-level scan has run out of tuples in
+ * the current scan direction).
+ *
+ * Caller must use _bt_tuple_before_array_skeys to determine if the current
+ * place in the scan is >= the current array keys _before_ calling here.
+ * We're responsible for ensuring that caller's tuple is <= the newly advanced
+ * required array keys once we return.  We try to find an exact match, but
+ * failing that we'll advance the array keys to whatever set of array elements
+ * comes next in the key space for the current scan direction.  Required array
+ * keys "ratchet forwards" (or backwards).  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The rules are the same for backwards scans, except that the operators are
+ * flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Sometimes that'll
+ * be the sole reason for calling here.  These calls are the only exception to
+ * the general rule about always advancing the array keys.  (That rule only
+ * applies when a required scan key was found to be unsatisfied.)
+ *
+ * Note that we deal with non-array required equality strategy scan keys as
+ * degenerate single element arrays here.  Obviously, they can never really
+ * advance in the way that real arrays can, but they must still affect how we
+ * advance real array scan keys (exactly like true array equality scan keys).
+ * We have to keep around a 3-way ORDER proc for these (using the "=" operator
+ * won't do), since in general whether the tuple is < or > _any_ unsatisfied
+ * required equality key influences how the scan's real arrays must advance.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple.  This is how we deal with inequalities that are
+ * required in the current scan direction.  They can advance the array keys
+ * here, even though they don't influence the initial positioning strategy
+ * within _bt_first (only inequalities required in the _opposite_ direction to
+ * the scan influence _bt_first in this way).  When sktrig (which is an offset
+ * to the unsatisfied scan key set by _bt_check_compare) is for a required
+ * inequality scan key, we'll perform array key advancement.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+					   int sktrig)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	int			arrayidx = 0;
+	bool		arrays_advanced = false,
+				beyond_end_advance = false,
+				sktrig_required = false,
+				has_required_opposite_direction_only = false,
+				oppodir_inequality_sktrig = false,
+				all_required_satisfied = true,
+				all_satisfied = true;
+
+	/*
+	 * Precondition array state assertions
+	 */
+	Assert(!so->needPrimScan && so->primScanDir == dir);
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+	Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupnatts, tupdesc,
+										 false, 0, NULL));
+
+	so->scanBehind = false;		/* reset */
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		FmgrInfo   *orderproc;
+		BTArrayKeyInfo *array = NULL;
+		Datum		tupdatum;
+		bool		required = false,
+					required_opposite_direction_only = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			/* Manage array state */
+			if (cur->sk_flags & SK_SEARCHARRAY)
+			{
+				array = &so->arrayKeys[arrayidx++];
+				Assert(array->scan_key == ikey);
+			}
+		}
+		else
+		{
+			/*
+			 * Are any inequalities required in the opposite direction only
+			 * present here?
+			 */
+			if (((ScanDirectionIsForward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQBKWD))) ||
+				 (ScanDirectionIsBackward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQFWD)))))
+				has_required_opposite_direction_only =
+					required_opposite_direction_only = true;
+		}
+
+		/* Optimization: skip over known-satisfied scan keys */
+		if (ikey < sktrig)
+			continue;
+
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			required = true;
+
+			if (ikey == sktrig)
+			{
+				/*
+				 * Required scan key wasn't satisfied, so required arrays will
+				 * have to advance.  Invalidate page-level state that tracks
+				 * whether the scan's required-in-opposite-direction-only keys
+				 * are known to be satisfied by page's remaining tuples.
+				 */
+				sktrig_required = true;
+				pstate->firstmatch = false;
+
+				/* Shouldn't have to invalidate 'prechecked', though */
+				Assert(!pstate->prechecked);
+			}
+
+			if (cur->sk_attno > tupnatts)
+			{
+				/* Set this just like _bt_tuple_before_array_skeys */
+				Assert(sktrig < ikey);
+				so->scanBehind = true;
+			}
+		}
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now. (We only do comparisons of any required
+		 * non-array equality scan keys that come after the triggering key.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; we aren't capable of directly
+		 * evaluating required inequality strategy scan keys here, on our own.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(required && all_required_satisfied && !arrays_advanced);
+
+			/* Use "beyond end" advancement.  See below for an explanation. */
+			beyond_end_advance = true;
+			all_satisfied = all_required_satisfied = false;
+
+			/*
+			 * Set a flag that remembers that this was an inequality required
+			 * in the opposite scan direction only, that nevertheless
+			 * triggered the call here.
+			 *
+			 * This only happens when an inequality operator (which must be
+			 * strict) encounters a group of NULLs that indicate the end of
+			 * non-NULL values for tuples in the current scan direction.
+			 */
+			if (unlikely(required_opposite_direction_only))
+				oppodir_inequality_sktrig = true;
+
+			continue;
+		}
+
+		/*
+		 * Nothing more for us to do with an inequality strategy scan key that
+		 * wasn't the one that _bt_check_compare stopped on, though.
+		 *
+		 * Note: if our later call to _bt_check_compare (to recheck caller's
+		 * tuple) sets continuescan=false due to finding this same inequality
+		 * unsatisfied (possible when it's required in the scan direction),
+		 * we'll deal with it via a recursive "second pass" call.
+		 */
+		else if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Nothing for us to do with an equality strategy scan key that isn't
+		 * marked required, either.
+		 *
+		 * Non-required array scan keys are the only exception.  They're a
+		 * special case in that _bt_check_compare can set continuescan=false
+		 * for them, just as it will given an unsatisfied required scan key.
+		 * It's convenient to follow the same convention, since it results in
+		 * our getting called to advance non-required arrays in the same way
+		 * as required arrays (though we avoid stopping the scan for them).
+		 */
+		else if (!required && !array)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				cur->sk_argument = array->elem_values[final_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_required_satisfied || cur->sk_attno > tupnatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				cur->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		orderproc = &so->orderProcs[ikey];
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		ratchets = (required && !arrays_advanced);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+											  tupdatum, tupnull,
+											  array, cur, &result);
+
+			/*
+			 * Required arrays only ever ratchet forwards (backwards).
+			 *
+			 * This condition makes it safe for binary searches to skip over
+			 * array elements that the scan must already be ahead of by now.
+			 * That is strictly an optimization.  Our assertion verifies that
+			 * the condition holds, which doesn't depend on the optimization.
+			 */
+			Assert(!ratchets ||
+				   ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+					(ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(required);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single value array.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling for required scan keys that lack
+		 * a real array to advance, nor for redundant scan keys that couldn't
+		 * be eliminated by _bt_preprocess_keys.  It won't matter if some of
+		 * our "true" array scan keys (or even all of them) are non-required.
+		 */
+		if (required &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		Assert(all_required_satisfied && all_satisfied);
+		if (result != 0)
+		{
+			/*
+			 * Track whether caller's tuple satisfies our new post-advancement
+			 * qual, for required scan keys, as well as for the entire set of
+			 * interesting scan keys (all required scan keys plus non-required
+			 * array scan keys are considered interesting.)
+			 */
+			all_satisfied = false;
+			if (required)
+				all_required_satisfied = false;
+			else
+			{
+				/*
+				 * There's no need to advance the arrays using the best
+				 * available match for a non-required array.  Give up now.
+				 * (Though note that sktrig_required calls still have to do
+				 * all the usual post-advancement steps, including the recheck
+				 * call to _bt_check_compare.)
+				 */
+				break;
+			}
+		}
+
+		/* Advance array keys, even when set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			cur->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
 		}
 	}
 
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is how top-level index
+	 * scans usually determine that they've run out of tuples to return in the
+	 * current scan direction (less often the top-level scan just runs out of
+	 * tuples/pages before the scan's array keys are exhausted).
 	 */
-	if (changed || !so->arraysStarted)
+	if (beyond_end_advance)
 	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
-}
+		Assert(sktrig_required && !all_required_satisfied && !all_satisfied);
 
+		if (_bt_advance_array_keys_increment(scan, dir))
+			arrays_advanced = true;
+		else
+		{
+			/* Arrays are exhausted */
+			goto end_toplevel_scan;
+		}
+	}
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	/*
+	 * Does tuple now satisfy our new qual?  Recheck with _bt_check_compare.
+	 *
+	 * Calls triggered by an unsatisfied required scan key, whose tuple now
+	 * satisfies all required scan keys, but not all nonrequired array keys,
+	 * will still require a recheck call to _bt_check_compare.  They'll still
+	 * need its "second pass" handling of required inequality scan keys.
+	 * (Might have missed a still-unsatisfied required inequality scan key
+	 * that caller didn't detect as the sktrig scan key during its initial
+	 * _bt_check_compare call that used the old/original qual.)
+	 *
+	 * Calls triggered by an unsatisfied nonrequired array scan key never need
+	 * "second pass" handling of required inequalities (nor any other handling
+	 * of any required scan key).  All that matters is whether caller's tuple
+	 * satisfies the new qual, so it's safe to just skip the _bt_check_compare
+	 * recheck when we've already determined that it'll just return 'false'.
+	 */
+	if ((sktrig_required && all_required_satisfied) ||
+		(!sktrig_required && all_satisfied))
+	{
+		int			nsktrig = sktrig + 1;
+
+		Assert(all_required_satisfied);
+
+		/* Recheck _bt_check_compare on behalf of caller */
+		if (_bt_check_compare(dir, so, tuple, tupnatts, tupdesc,
+							  false, false, false,
+							  &pstate->continuescan, &nsktrig) &&
+			!so->scanBehind)
+		{
+			/* This tuple satisfies the new qual */
+			Assert(all_satisfied);
+			return true;
+		}
+
+		/*
+		 * Consider "second pass" handling of required inequalities.
+		 *
+		 * It's possible that our _bt_check_compare call indicated that the
+		 * scan should end due to some unsatisfied inequality that wasn't
+		 * initially recognized as such by us.  Handle this by calling
+		 * ourselves recursively, this time indicating that the trigger is the
+		 * inequality that we missed first time around (and using a set of
+		 * required array/equality keys that are now exact matches for tuple).
+		 *
+		 * We make a strong, general guarantee that every _bt_checkkeys call
+		 * here will advance the array keys to the maximum possible extent
+		 * that we can know to be safe based on caller's tuple alone.  If we
+		 * didn't perform this step, then that guarantee wouldn't quite hold.
+		 */
+		if (unlikely(!pstate->continuescan))
+		{
+			bool		satisfied PG_USED_FOR_ASSERTS_ONLY;
+
+			Assert(sktrig_required);
+			Assert(so->keyData[nsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * The tuple must use "beyond end" advancement during the
+			 * recursive call, so we cannot possibly end up back here when
+			 * recursing.  We'll consume a small, fixed amount of stack space.
+			 */
+			Assert(!beyond_end_advance);
+
+			/* Advance the array keys a second time using same tuple */
+			satisfied = _bt_advance_array_keys(scan, pstate, tuple, tupnatts,
+											   tupdesc, nsktrig);
+
+			/* This tuple doesn't satisfy the inequality */
+			Assert(!satisfied);
+			return false;
+		}
+
+		/*
+		 * Some non-required scan key (from new qual) still not satisfied.
+		 *
+		 * All scan keys required in the current scan direction must still be
+		 * satisfied, though, so we can trust all_required_satisfied below.
+		 *
+		 * Note: it's still too early to tell if the current primitive index
+		 * scan can continue (has_required_opposite_direction_only steps might
+		 * still start a new primitive index scan instead).
+		 */
+	}
+
+	/*
+	 * Postcondition array state assertions (for still-unsatisfied tuples).
+	 *
+	 * Caller's tuple is now < the newly advanced array keys (or > when this
+	 * is a backwards scan) when not all required scan keys from the new qual
+	 * (including any required inequality keys) were found to be satisfied.
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, tupnatts, tupdesc,
+										false, 0, NULL) ==
+		   !all_required_satisfied);
+
+	/*
+	 * When we were called just to deal with "advancing" non-required arrays,
+	 * there's no way that we can need to start a new primitive index scan
+	 * (and it would be wrong to allow it).  Continue ongoing primitive scan.
+	 *
+	 * Note: this test isn't strictly necessary, but having explicit handling
+	 * is still useful.  It avoids wasting cycles on considering outcomes that
+	 * we already know are impossible (has_required_opposite_direction_only
+	 * steps are particularly worth avoiding when they aren't really needed).
+	 */
+	if (!sktrig_required)
+	{
+		Assert(all_required_satisfied);
+		goto continue_prim_scan;
+	}
+
+	/*
+	 * By here we have established that the scan's required arrays (there must
+	 * be at least one that's required) advanced, without becoming exhausted
+	 */
+	Assert(sktrig_required && arrays_advanced);
+
+	/*
+	 * We generally permit primitive index scans to continue onto the next
+	 * sibling page when the page's finaltup satisfies all required scan keys
+	 * at the point where we're between pages.
+	 *
+	 * If caller's tuple is also the page's finaltup, and we see that required
+	 * scan keys still aren't satisfied, start a new primitive index scan.
+	 */
+	if (!all_required_satisfied && pstate->finaltup == tuple)
+		goto new_prim_scan;
+
+	/*
+	 * Proactively check finaltup (don't wait until finaltup is reached by the
+	 * scan) when it might well turn out to not be satisfied later on.
+	 *
+	 * This isn't quite equivalent to looking ahead to check if finaltup will
+	 * also be satisfied by all required scan keys, since there isn't any real
+	 * handling of inequalities in _bt_tuple_before_array_skeys.  It wouldn't
+	 * make sense for us to evaluate inequalities when "looking ahead to
+	 * finaltup", though.  Inequalities that are required in the current scan
+	 * direction cannot affect how _bt_first repositions the top-level scan
+	 * (unless the scan direction happens to change).
+	 *
+	 * Note: if so->scanBehind hasn't already been set for finaltup by us,
+	 * it'll be set during this call to _bt_tuple_before_array_skeys.  Either
+	 * way, it'll be set correctly after this point.
+	 */
+	if (!all_required_satisfied && pstate->finaltup &&
+		_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup,
+									 BTreeTupleGetNAtts(pstate->finaltup, rel),
+									 tupdesc, false, 0, &so->scanBehind))
+		goto new_prim_scan;
+
+	/*
+	 * When we encounter a truncated finaltup high key attribute, we're
+	 * optimistic about the chances of its corresponding required scan key
+	 * being satisfied when we go on to check it against tuples from this
+	 * page's right sibling leaf page.  We consider truncated attributes to be
+	 * satisfied by required scan keys, which allows the primitive index scan
+	 * to continue to the next leaf page.  We must set so->scanBehind to true
+	 * to remember that the last page's finaltup had "satisfied" required scan
+	 * keys for one or more truncated attribute values (scan keys required in
+	 * _either_ scan direction).
+	 *
+	 * There is a chance that _bt_checkkeys (which checks so->scanBehind) will
+	 * find that even the sibling leaf page's finaltup is < the new array
+	 * keys.  When that happens, our optimistic policy will have incurred a
+	 * single extra leaf page access that could have been avoided.
+	 *
+	 * A pessimistic policy would give backward scans a gratuitous advantage
+	 * over forward scans.  We'd punish forward scans for applying more
+	 * accurate information from the high key, rather than just using the
+	 * final non-pivot tuple as finaltup, in the style of backward scans.
+	 * Being pessimistic would also give some scans with non-required arrays a
+	 * perverse advantage over similar scans that use required arrays instead.
+	 *
+	 * You can think of this as a speculative bet on what the scan is likely
+	 * to find on the next page.  It's not much of a gamble, though, since the
+	 * untruncated prefix of attributes must strictly satisfy the new qual
+	 * (though it's okay if any non-required scan keys fail to be satisfied).
+	 */
+	if (so->scanBehind && has_required_opposite_direction_only)
+	{
+		/*
+		 * However, we avoid this behavior whenever the scan involves a scan
+		 * key required in the opposite direction to the scan only, along with
+		 * a finaltup with at least one truncated attribute that's associated
+		 * with a scan key marked required (required in either direction).
+		 *
+		 * _bt_check_compare simply won't stop the scan for a scan key that's
+		 * marked required in the opposite scan direction only.  That leaves
+		 * us without any reliable way of reconsidering any opposite-direction
+		 * inequalities if it turns out that starting a new primitive index
+		 * scan will allow _bt_first to skip ahead by a great many leaf pages
+		 * (see next section for details of how that works).
+		 */
+		goto new_prim_scan;
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 * They can also signal that we should start a new primitive index scan.
+	 *
+	 * It's possible that the scan is now positioned where "matching" tuples
+	 * begin, and that caller's tuple satisfies all scan keys required in the
+	 * current scan direction.  But if caller's tuple still doesn't satisfy
+	 * other scan keys that are required in the opposite scan direction only
+	 * (e.g., a required >= strategy scan key when scan direction is forward),
+	 * it's still possible that there are many leaf pages before the page that
+	 * _bt_first could skip straight to.  Groveling through all those pages
+	 * will always give correct answers, but it can be very inefficient.  We
+	 * must avoid needlessly scanning extra pages.
+	 *
+	 * Separately, it's possible that _bt_check_compare set continuescan=false
+	 * for a scan key that's required in the opposite direction only.  This is
+	 * a special case, that happens only when _bt_check_compare sees that the
+	 * inequality encountered a NULL value.  This signals the end of non-NULL
+	 * values in the current scan direction, which is reason enough to end the
+	 * (primitive) scan.  If this happens at the start of a large group of
+	 * NULL values, then we shouldn't expect to be called again until after
+	 * the scan has already read indefinitely-many leaf pages full of tuples
+	 * with NULL suffix values.  We need a separate test for this case so that
+	 * we don't miss our only opportunity to skip over such a group of pages.
+	 *
+	 * Apply a test against finaltup to detect and recover from the problem:
+	 * if even finaltup doesn't satisfy such an inequality, we just skip by
+	 * starting a new primitive index scan.  When we skip, we know for sure
+	 * that all of the tuples on the current page following caller's tuple are
+	 * also before the _bt_first-wise start of tuples for our new qual.  That
+	 * at least suggests many more skippable pages beyond the current page.
+	 */
+	if (has_required_opposite_direction_only && pstate->finaltup &&
+		(all_required_satisfied || oppodir_inequality_sktrig))
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped;
+		bool		continuescanflip;
+		int			opsktrig;
+
+		/*
+		 * We're checking finaltup (which is usually not caller's tuple), so
+		 * cannot reuse work from caller's earlier _bt_check_compare call.
+		 *
+		 * Flip the scan direction when calling _bt_check_compare this time,
+		 * so that it will set continuescanflip=false when it encounters an
+		 * inequality required in the opposite scan direction.
+		 */
+		Assert(!so->scanBehind);
+		opsktrig = 0;
+		flipped = -dir;
+		_bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+						  tupdesc, false, false, false,
+						  &continuescanflip, &opsktrig);
+
+		/*
+		 * If we ended up here due to the all_required_satisfied criteria,
+		 * test opsktrig in a way that ensures that finaltup contains the same
+		 * prefix of key columns as caller's tuple (a prefix that satisfies
+		 * earlier required-in-current-direction scan keys).
+		 *
+		 * If we ended up here due to the oppodir_inequality_sktrig criteria,
+		 * test opsktrig in a way that ensures that the same scan key that our
+		 * caller found to be unsatisfied (by the scan's tuple) was also the
+		 * one unsatisfied just now (by finaltup).  That way we'll only start
+		 * a new primitive scan when we're sure that both tuples _don't_ share
+		 * the same prefix of satisfied equality-constrained attribute values,
+		 * and that finaltup has a non-NULL attribute value indicated by the
+		 * unsatisfied scan key at offset opsktrig/sktrig.  (This depends on
+		 * _bt_check_compare not caring about the direction that inequalities
+		 * are required in whenever NULL attribute values are unsatisfied.  It
+		 * only cares about the scan direction, and its relationship to
+		 * whether NULLs are stored first or last relative to non-NULLs.)
+		 */
+		Assert(all_required_satisfied != oppodir_inequality_sktrig);
+		if (unlikely(!continuescanflip &&
+					 ((all_required_satisfied && opsktrig > sktrig) ||
+					  (oppodir_inequality_sktrig && opsktrig >= sktrig))))
+		{
+			Assert(so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * Make sure that any non-required arrays are set to the first
+			 * array element for the current scan direction
+			 */
+			_bt_rewind_nonrequired_arrays(scan, dir);
+
+			goto new_prim_scan;
+		}
+	}
+
+continue_prim_scan:
+
+	/*
+	 * Stick with the ongoing primitive index scan for now.
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will quickly discover the next tuple covered
+	 * by the current array keys.  Repeated binary searches of the page (one
+	 * binary search per array advancement) is unlikely to outperform one
+	 * continuous linear search of the whole page.
+	 */
+	pstate->continuescan = true;	/* Override _bt_check_compare */
+	so->needPrimScan = false;	/* _bt_readpage has more tuples to check */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+new_prim_scan:
+
+	/*
+	 * End this primitive index scan, but schedule another.
+	 *
+	 * Note: If the scan direction happens to change, this scheduled primitive
+	 * index scan won't go ahead after all.
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = true;	/* ...but call _bt_first again */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+end_toplevel_scan:
+
+	/*
+	 * End the current primitive index scan, but don't schedule another.
+	 *
+	 * This ends the entire top-level scan in the current scan direction.
+	 *
+	 * Note: The scan's arrays (including any non-required arrays) are now in
+	 * their final positions for the current scan direction.  If the scan
+	 * direction happens to change, then the arrays will already be in their
+	 * first positions for what will then be the current scan direction.
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = false;	/* ...don't call _bt_first again, though */
+	so->scanBehind = false;
+
+	/* Caller's tuple doesn't match any qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
  *
- * The given search-type keys (in scan->keyData[] or so->arrayKeyData[])
+ * The given search-type keys (taken from scan->keyData[])
  * are copied to so->keyData[] with possible transformation.
  * scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
  * the number of output keys (possibly less, never greater).
@@ -691,7 +2435,12 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
  * within each attribute may be done as a byproduct of the processing here,
- * but no other code depends on that.
+ * but only array advancement depends on that (when dealing with multiple
+ * arrays against the same attribute).  Index scans with array scan keys also
+ * depend on state (maintained here by us) that maps each of our input scan
+ * keys to its corresponding output scan key.  This indirection allows index
+ * scans to use an ikey offset-to-output-scankey to look up the cached ORDER
+ * proc for the scankey.
  *
  * The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
  * if they must be satisfied in order to continue the scan forward or backward
@@ -748,8 +2497,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
  *
  * Note: the reason we have to copy the preprocessed scan keys into private
  * storage is that we are modifying the array based on comparisons of the
- * key argument values, which could change on a rescan or after moving to
- * new elements of array keys.  Therefore we can't overwrite the source data.
+ * key argument values, which could change on a rescan.  Therefore we can't
+ * overwrite the source data.
  */
 void
 _bt_preprocess_keys(IndexScanDesc scan)
@@ -762,11 +2511,37 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
-	ScanKey		xform[BTMaxStrategyNumber];
+	BTScanKeyPreproc xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
 	AttrNumber	attno;
+	ScanKey		arrayKeyData;
+	int		   *keyDataMap = NULL;
+	int			arrayidx = 0;
+
+	Assert(!so->needPrimScan);
+
+	/*
+	 * We're called at the start of each primitive index scan during top-level
+	 * scans that use equality array keys.  We can reuse the scan keys that
+	 * were output at the start of the scan's first primitive index scan.
+	 * There is no need to perform exactly the same work more than once.
+	 */
+	if (so->numberOfKeys > 0)
+	{
+		/*
+		 * An earlier call to _bt_advance_array_keys already set everything up
+		 * for us.  Just assert that the scan's existing output scan keys are
+		 * consistent with its current array elements.
+		 */
+		Assert(so->numArrayKeys && !ScanDirectionIsNoMovement(so->primScanDir));
+		Assert(_bt_verify_keys_with_arraykeys(scan));
+		return;
+	}
+
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -775,11 +2550,28 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
+	arrayKeyData = _bt_preprocess_array_keys(scan);
+	if (!so->qual_ok)
+	{
+		/* unmatchable array, so give up */
+		so->qual_ok = false;
+		return;
+	}
+
 	/*
-	 * Read so->arrayKeyData if array keys are present, else scan->keyData
+	 * Treat arrayKeyData[] (a partially preprocessed copy of scan->keyData[])
+	 * as our input if _bt_preprocess_array_keys just allocated it, else just
+	 * use scan->keyData[]
 	 */
-	if (so->arrayKeyData != NULL)
-		inkeys = so->arrayKeyData;
+	if (arrayKeyData)
+	{
+		inkeys = arrayKeyData;
+
+		/* Also maintain keyDataMap for remapping so->orderProc[] later */
+		keyDataMap = MemoryContextAlloc(so->arrayContext,
+										numberOfKeys * sizeof(int));
+	}
 	else
 		inkeys = scan->keyData;
 
@@ -800,6 +2592,18 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
 			_bt_mark_scankey_required(outkeys);
+		if (arrayKeyData)
+		{
+			/*
+			 * Don't call _bt_preprocess_array_keys_final in this fast path
+			 * (we'll miss out on the single value array transformation, but
+			 * that's not nearly as important when there's only one scan key)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+				   so->arrayKeys[0].scan_key == 0);
+		}
+
 		return;
 	}
 
@@ -859,13 +2663,29 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 * check, and we've rejected any combination of it with a regular
 			 * equality condition; but not with other types of conditions.
 			 */
-			if (xform[BTEqualStrategyNumber - 1])
+			if (xform[BTEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1];
+				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
+				BTArrayKeyInfo *array = NULL;
+				FmgrInfo   *orderproc = NULL;
+
+				if (arrayKeyData && (eq->sk_flags & SK_SEARCHARRAY))
+				{
+					int			eq_in_ikey,
+								eq_arrayidx;
+
+					eq_in_ikey = xform[BTEqualStrategyNumber - 1].ikey;
+					eq_arrayidx = xform[BTEqualStrategyNumber - 1].arrayidx;
+					array = &so->arrayKeys[eq_arrayidx - 1];
+					orderproc = so->orderProcs + eq_in_ikey;
+
+					Assert(array->scan_key == eq_in_ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
 
 				for (j = BTMaxStrategyNumber; --j >= 0;)
 				{
-					ScanKey		chk = xform[j];
+					ScanKey		chk = xform[j].skey;
 
 					if (!chk || j == (BTEqualStrategyNumber - 1))
 						continue;
@@ -878,6 +2698,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					}
 
 					if (_bt_compare_scankey_args(scan, chk, eq, chk,
+												 array, orderproc,
 												 &test_result))
 					{
 						if (!test_result)
@@ -887,7 +2708,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							return;
 						}
 						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						Assert(!array || array->num_elems > 0);
+						xform[j].skey = NULL;
+						xform[j].ikey = -1;
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -896,36 +2719,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			}
 
 			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1]
-				&& xform[BTLessEqualStrategyNumber - 1])
+			if (xform[BTLessStrategyNumber - 1].skey
+				&& xform[BTLessEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1];
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1];
+				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, le, lt, le,
+				if (_bt_compare_scankey_args(scan, le, lt, le, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1] = NULL;
+						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTLessStrategyNumber - 1] = NULL;
+						xform[BTLessStrategyNumber - 1].skey = NULL;
 				}
 			}
 
 			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1]
-				&& xform[BTGreaterEqualStrategyNumber - 1])
+			if (xform[BTGreaterStrategyNumber - 1].skey
+				&& xform[BTGreaterEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1];
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1];
+				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, ge, gt, ge,
+				if (_bt_compare_scankey_args(scan, ge, gt, ge, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1] = NULL;
+						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTGreaterStrategyNumber - 1] = NULL;
+						xform[BTGreaterStrategyNumber - 1].skey = NULL;
 				}
 			}
 
@@ -936,11 +2759,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 */
 			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				if (xform[j])
+				if (xform[j].skey)
 				{
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-					memcpy(outkey, xform[j], sizeof(ScanKeyData));
+					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+					if (arrayKeyData)
+						keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 					if (priorNumberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
 				}
@@ -966,6 +2791,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
+			if (arrayKeyData)
+				keyDataMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 
@@ -977,20 +2804,112 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
-		/* have we seen one of these before? */
-		if (xform[j] == NULL)
+		/*
+		 * Does this input scan key require further processing as an array?
+		 */
+		if (cur->sk_strategy == InvalidStrategy)
 		{
-			/* nope, so remember this scankey */
-			xform[j] = cur;
+			/* _bt_preprocess_array_keys marked this array key redundant */
+			Assert(arrayKeyData);
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			continue;
+		}
+
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			(cur->sk_flags & SK_SEARCHARRAY))
+		{
+			/* _bt_preprocess_array_keys kept this array key */
+			Assert(arrayKeyData);
+			arrayidx++;
+		}
+
+		/*
+		 * have we seen a scan key for this same attribute and using this same
+		 * operator strategy before now?
+		 */
+		if (xform[j].skey == NULL)
+		{
+			/* nope, so this scan key wins by default (at least for now) */
+			xform[j].skey = cur;
+			xform[j].ikey = i;
+			xform[j].arrayidx = arrayidx;
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
-										 &test_result))
+			FmgrInfo   *orderproc = NULL;
+			BTArrayKeyInfo *array = NULL;
+
+			/*
+			 * Seen one of these before, so keep only the more restrictive key
+			 * if possible
+			 */
+			if (j == (BTEqualStrategyNumber - 1) && arrayKeyData)
 			{
+				/*
+				 * Have to set up array keys
+				 */
+				if ((cur->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[arrayidx - 1];
+					orderproc = so->orderProcs + i;
+
+					Assert(array->scan_key == i);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+				else if ((xform[j].skey->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[xform[j].arrayidx - 1];
+					orderproc = so->orderProcs + xform[j].ikey;
+
+					Assert(array->scan_key == xform[j].ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+
+				/*
+				 * Both scan keys might have arrays, in which case we'll
+				 * arbitrarily pass only one of the arrays.  That won't
+				 * matter, since _bt_compare_scankey_args is aware that two
+				 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
+				 * failed to eliminate redundant arrays through array merging.
+				 * _bt_compare_scankey_args just returns false when it sees
+				 * this; it won't even try to examine either array.
+				 */
+			}
+
+			if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+										 array, orderproc, &test_result))
+			{
+				/* Have all we need to determine redundancy */
 				if (test_result)
-					xform[j] = cur;
+				{
+					Assert(!array || array->num_elems > 0);
+
+					/*
+					 * New key is more restrictive, and so replaces old key...
+					 */
+					if (j != (BTEqualStrategyNumber - 1) ||
+						!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
+					{
+						Assert(!array || array->scan_key == i);
+						xform[j].skey = cur;
+						xform[j].ikey = i;
+						xform[j].arrayidx = arrayidx;
+					}
+					else
+					{
+						/*
+						 * ...unless we have to keep the old key because it's
+						 * an array that rendered the new key redundant.  We
+						 * need to make sure that we don't throw away an array
+						 * scan key.  _bt_compare_scankey_args expects us to
+						 * always keep arrays (and discard non-arrays).
+						 */
+						Assert(j == (BTEqualStrategyNumber - 1));
+						Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
+						Assert(xform[j].ikey == array->scan_key);
+						Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+					}
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1002,22 +2921,130 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			else
 			{
 				/*
-				 * We can't determine which key is more restrictive.  Keep the
-				 * previous one in xform[j] and push this one directly to the
-				 * output array.
+				 * We can't determine which key is more restrictive.  Push
+				 * xform[j] directly to the output array, then set xform[j] to
+				 * the new scan key.
+				 *
+				 * Note: We do things this way around so that our arrays are
+				 * always in the same order as their corresponding scan keys,
+				 * even with incomplete opfamilies.  _bt_advance_array_keys
+				 * depends on this.
 				 */
 				ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-				memcpy(outkey, cur, sizeof(ScanKeyData));
+				memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+				if (arrayKeyData)
+					keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 				if (numberOfEqualCols == attno - 1)
 					_bt_mark_scankey_required(outkey);
+				xform[j].skey = cur;
+				xform[j].ikey = i;
+				xform[j].arrayidx = arrayidx;
 			}
 		}
 	}
 
 	so->numberOfKeys = new_numberOfKeys;
+
+	/*
+	 * Now that we've output so->keyData[], and built a temporary mapping from
+	 * so->keyData[] (output scan keys) to scan->keyData[] (input scan keys),
+	 * fix each array->scan_key reference.  (Also consolidates so->orderProc[]
+	 * array, so it can be subscripted using so->keyData[]-wise offsets.)
+	 */
+	if (arrayKeyData)
+		_bt_preprocess_array_keys_final(scan, keyDataMap);
+
+	/* Could pfree arrayKeyData/keyDataMap now, but not worth the cycles */
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * Verify that the scan's qual state matches what we expect at the point that
+ * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
+ *
+ * We enforce a rule against non-required array scan keys: they must start out
+ * with whatever element is the first for the scan's current scan direction.
+ * See _bt_rewind_nonrequired_arrays comments for an explanation.
+ */
+static bool
+_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
+
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+			return false;
+	}
+
+	return _bt_verify_keys_with_arraykeys(scan);
+}
+
+/*
+ * Verify that the scan's "so->keyData[]" scan keys are in agreement with
+ * its array key state
+ */
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			last_sk_attno = InvalidAttrNumber,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		if (array->scan_key != ikey)
+			return false;
+
+		if (array->num_elems <= 0)
+			return false;
+
+		if (cur->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+		if (last_sk_attno > cur->sk_attno)
+			return false;
+		last_sk_attno = cur->sk_attno;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1033,9 +3060,24 @@ _bt_preprocess_keys(IndexScanDesc scan)
  * we store the operator result in *result and return true.  We return false
  * if the comparison could not be made.
  *
+ * If either leftarg or rightarg are an array, we'll apply array-specific
+ * rules to determine which array elements are redundant on behalf of caller.
+ * It is up to our caller to save whichever of the two scan keys is the array,
+ * and discard the non-array scan key (the non-array scan key is guaranteed to
+ * be redundant with any complete opfamily).  Caller isn't expected to call
+ * here with a pair of array scan keys provided we're dealing with a complete
+ * opfamily (_bt_preprocess_array_keys will merge array keys together to make
+ * sure of that).
+ *
+ * Note: we'll also shrink caller's array as needed to eliminate redundant
+ * array elements.  One reason why caller should prefer to discard non-array
+ * scan keys is so that we'll have the opportunity to shrink the array
+ * multiple times, in multiple calls (for each of several other scan keys on
+ * the same index attribute).
+ *
  * Note: op always points at the same ScanKey as either leftarg or rightarg.
- * Since we don't scribble on the scankeys, this aliasing should cause no
- * trouble.
+ * Since we don't scribble on the scankeys themselves, this aliasing should
+ * cause no trouble.
  *
  * Note: this routine needs to be insensitive to any DESC option applied
  * to the index column.  For example, "x < 4" is a tighter constraint than
@@ -1044,6 +3086,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 static bool
 _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 						 ScanKey leftarg, ScanKey rightarg,
+						 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 						 bool *result)
 {
 	Relation	rel = scan->indexRelation;
@@ -1117,6 +3160,46 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 	 */
 	Assert(leftarg->sk_attno == rightarg->sk_attno);
 
+	/*
+	 * If either leftarg or rightarg are equality-type array scankeys, we need
+	 * specialized handling (since by now we know that IS NULL wasn't used)
+	 */
+	if (array)
+	{
+		bool		leftarray,
+					rightarray;
+
+		leftarray = ((leftarg->sk_flags & SK_SEARCHARRAY) &&
+					 leftarg->sk_strategy == BTEqualStrategyNumber);
+		rightarray = ((rightarg->sk_flags & SK_SEARCHARRAY) &&
+					  rightarg->sk_strategy == BTEqualStrategyNumber);
+
+		/*
+		 * _bt_preprocess_array_keys is responsible for merging together array
+		 * scan keys, and will do so whenever the opfamily has the required
+		 * cross-type support.  If it failed to do that, we handle it just
+		 * like the case where we can't make the comparison ourselves.
+		 */
+		if (leftarray && rightarray)
+		{
+			/* Can't make the comparison */
+			*result = false;	/* suppress compiler warnings */
+			return false;
+		}
+
+		/*
+		 * Otherwise we need to determine if either one of leftarg or rightarg
+		 * uses an array, then pass this through to a dedicated helper
+		 * function.
+		 */
+		if (leftarray)
+			return _bt_compare_array_scankey_args(scan, leftarg, rightarg,
+												  orderproc, array, result);
+		else if (rightarray)
+			return _bt_compare_array_scankey_args(scan, rightarg, leftarg,
+												  orderproc, array, result);
+	}
+
 	opcintype = rel->rd_opcintype[leftarg->sk_attno - 1];
 
 	/*
@@ -1351,60 +3434,194 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
  * Forward scan callers can pass a high key tuple in the hopes of having
  * us set *continuescan to false, and avoiding an unnecessary visit to
  * the page to the right.
  *
+ * Advances the scan's array keys when necessary for arrayKeys=true callers.
+ * Caller can avoid all array related side-effects when calling just to do a
+ * page continuescan precheck -- pass arrayKeys=false for that.  Scans without
+ * any arrays keys must always pass arrayKeys=false.
+ *
+ * Also stops and starts primitive index scans for arrayKeys=true callers.
+ * Scans with array keys are required to set up page state that helps us with
+ * this.  The page's finaltup tuple (the page high key for a forward scan, or
+ * the page's first non-pivot tuple for a backward scan) must be set in
+ * pstate.finaltup ahead of the first call here for the page (or possibly the
+ * first call after an initial continuescan-setting page precheck call).  Set
+ * this to NULL for rightmost page (or the leftmost page for backwards scans).
+ *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: page level input and output parameters
+ * arrayKeys: should we advance the scan's array keys if necessary?
  * tuple: index tuple to test
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
- * 						   be true for the last item on the page
- * haveFirstMatch: indicates that we already have at least one match
- * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
-			  bool continuescanPrechecked, bool haveFirstMatch)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+			  IndexTuple tuple, int tupnatts)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanDirection dir = pstate->dir;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!arrayKeys || (so->primScanDir == dir && so->arrayKeys));
+	Assert(!so->scanBehind || ScanDirectionIsForward(dir));
+	Assert(!so->needPrimScan);
 
+	res = _bt_check_compare(dir, so, tuple, tupnatts, tupdesc,
+							arrayKeys, pstate->prechecked, pstate->firstmatch,
+							&pstate->continuescan, &ikey);
+
+#ifdef USE_ASSERT_CHECKING
+	if (pstate->prechecked || pstate->firstmatch)
+	{
+		bool		dcontinuescan;
+		int			dikey = 0;
+
+		Assert(res == _bt_check_compare(dir, so, tuple, tupnatts, tupdesc,
+										arrayKeys, false, false,
+										&dcontinuescan, &dikey));
+		Assert(dcontinuescan == pstate->continuescan && ikey == dikey);
+	}
+#endif
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality strategy array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * pstate.continuescan=false.
+	 */
+	if (!arrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.  This could mean that the tuple is just past
+	 * the end of matches for the current array keys.
+	 *
+	 * It's also possible that the scan is still _before_ the _start_ of
+	 * tuples matching the current set of array keys.  Check for that first.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, dir, tuple, tupnatts, tupdesc,
+									 true, ikey, NULL))
+	{
+		/*
+		 * Tuple is still before the start of matches according to the scan's
+		 * required array keys (according to _all_ of its required equality
+		 * strategy keys, actually).
+		 *
+		 * Note: we will end up here repeatedly given a group of tuples > the
+		 * previous array keys and < the now-current keys (though only when
+		 * _bt_advance_array_keys determined that key space relevant to the
+		 * scan covers some of the page's remaining unscanned tuples).
+		 *
+		 * _bt_advance_array_keys occasionally sets so->scanBehind to signal
+		 * that the scan's current position/tuples might be significantly
+		 * behind (multiple pages behind) its current array keys.  When this
+		 * happens, we check the page finaltup ourselves.  We'll start a new
+		 * primitive index scan on our own if it turns out that the scan isn't
+		 * now on a page that has at least some tuples covered by the key
+		 * space of the arrays.
+		 *
+		 * This scheme allows _bt_advance_array_keys to optimistically assume
+		 * that the scan will find array key matches for any truncated
+		 * finaltup attributes once the scan reaches the right sibling page
+		 * (only the untruncated prefix has to match the scan's array keys).
+		 */
+		Assert(!so->scanBehind ||
+			   so->keyData[ikey].sk_strategy == BTEqualStrategyNumber);
+		if (unlikely(so->scanBehind) && pstate->finaltup &&
+			_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup,
+										 BTreeTupleGetNAtts(pstate->finaltup,
+															scan->indexRelation),
+										 tupdesc, false, 0, NULL))
+		{
+			/* Cut our losses -- start a new primitive index scan now */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/* Override _bt_check_compare, continue primitive scan */
+			pstate->continuescan = true;
+		}
+
+		/* This indextuple doesn't match the current qual, in any case */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scan).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (unless the required array keys become exhausted instead, or
+	 * unless the ikey trigger corresponds to a non-required array scan key).
+	 *
+	 * Note: we might advance the required arrays when all existing keys are
+	 * already equal to the values from the tuple at this point.  See comments
+	 * above _bt_advance_array_keys about inequality driven array advancement.
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, tupnatts, tupdesc, ikey);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also sets *continuescan to false
+ * when it's also not possible for any later tuples to pass the current qual
+ * (with the scan's current set of array keys, in the current scan direction),
+ * in addition to setting *ikey to the so->keyData[] subscript/offset for the
+ * unsatisfied scan key (needed when caller must consider advancing the scan's
+ * array keys).
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys ends the ongoing
+ * primitive index scan.  It is up to our caller to override that initial
+ * determination when it makes more sense to advance the array keys and
+ * continue with further tuples from the same leaf page.
+ *
+ * Note: we set *continuescan to false for arrayKeys=true callers in the event
+ * of an unsatisfied non-required array equality scan key, despite the fact
+ * that it's never safe to end the current primitive index scan when that
+ * happens.  Caller will still need to consider "advancing" the array keys
+ * (which isn't all that different to what happens to truly required arrays).
+ * Caller _must_ unset continuescan once non-required arrays have advanced.
+ * Callers that pass arrayKeys=false won't get this behavior, which is useful
+ * when the focus is on whether the scan's required scan keys are satisfied.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool arrayKeys, bool prechecked, bool firstmatch,
+				  bool *continuescan, int *ikey)
+{
 	*continuescan = true;		/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (; *ikey < so->numberOfKeys; (*ikey)++)
 	{
+		ScanKey		key = so->keyData + *ikey;
 		Datum		datum;
 		bool		isNull;
-		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1422,8 +3639,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
-			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
+		if (prechecked &&
+			(requiredSameDir || (requiredOppositeDirOnly && firstmatch)) &&
+			!(key->sk_flags & SK_ROW_HEADER))
 			continue;
 
 		if (key->sk_attno > tupnatts)
@@ -1434,7 +3652,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1495,6 +3712,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1511,6 +3730,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1524,24 +3745,15 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key-checking function, though only if we must.
+		 *
+		 * When a key is required in the opposite-of-scan direction _only_,
+		 * then it must already be satisfied if firstmatch=true indicates that
+		 * an earlier tuple from this same page satisfied it earlier on.
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
-		{
-			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-									 datum, key->sk_argument);
-		}
-		else
-		{
-			test = true;
-			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
-											 datum, key->sk_argument));
-		}
-
-		if (!DatumGetBool(test))
+		if (!(requiredOppositeDirOnly && firstmatch) &&
+			!DatumGetBool(FunctionCall2Coll(&key->sk_func, key->sk_collation,
+											datum, key->sk_argument)))
 		{
 			/*
 			 * Tuple fails this qual.  If it's a required qual for the current
@@ -1556,6 +3768,14 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			if (requiredSameDir)
 				*continuescan = false;
 
+			/*
+			 * Also set continuescan=false for non-required equality-type
+			 * array keys that don't pass (during arrayKeys=true calls)
+			 */
+			if (arrayKeys && (key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+				*continuescan = false;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1574,7 +3794,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1603,7 +3823,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
@@ -1630,6 +3849,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1646,6 +3867,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbd..2230b1310 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,12 +816,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
 	int			indexcol;
 
+	Assert(skip_nonnative_saop != NULL || scantype == ST_BITMAPSCAN);
+
 	/*
 	 * Check that index supports the desired scan type(s)
 	 */
@@ -880,19 +849,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +864,18 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			if (skip_nonnative_saop && !index->amsearcharray &&
+				IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
-				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
+				/*
+				 * Caller asked us to generate IndexPaths that omit any
+				 * ScalarArrayOpExpr clauses when the underlying index AM
+				 * lacks native support.
+				 *
+				 * We must omit this clause (and tell caller about it).
+				 */
+				*skip_nonnative_saop = true;
+				continue;
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d..772dc664f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6557,8 +6557,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6573,7 +6571,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed.
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6603,19 +6601,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6653,27 +6640,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6683,11 +6674,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan.
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6703,10 +6692,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6719,7 +6706,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6797,7 +6784,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6812,17 +6798,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6862,14 +6843,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				double		alength = estimate_array_length(root, other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6929,13 +6905,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6945,6 +6914,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6952,9 +6963,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +6982,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac28..e49a4c0c1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4064,6 +4064,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Queries that use certain <acronym>SQL</acronym> constructs to search for
+    rows matching any value out of a list or array of multiple scalar values
+    (see <xref linkend="functions-comparisons"/>) perform multiple
+    <quote>primitive</quote> index scans (up to one primitive scan per scalar
+    value) during query execution.  Each internal primitive index scan
+    increments <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>,
+    so it's possible for the count of index scans to significantly exceed the
+    total number of index scan executor node executions.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 70ab47a92..ef7b84662 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1932,16 +1932,16 @@ ORDER BY unique1;
       42
 (3 rows)
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1952,26 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 960540002..a031d2341 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8880,10 +8880,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..90a33795d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -765,6 +765,7 @@ SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -774,18 +775,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.43.0

#53

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#52)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I've attached v14, where 0001 is v13, 0002 is a patch with small
changes + some large comment suggestions, and 0003 which contains
sorted merge join code for _bt_merge_arrays.

I have accepted your changes from 0003. Agree that it's better that
way. It's at least a little faster, but not meaningfully more
complicated.

Thanks.

I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I
can make of it.

That's been the big focus of this new v15, which now goes all out with
teaching _bt_preprocess_keys with how to deal with arrays. We now do
comprehensive normalization of even very complicated combinations of
arrays and non-array scan keys in this version.

I was thinking about a more unified processing model, where
_bt_preprocess_keys would iterate over all keys, including processing
of array keys (by merging and reduction to normal keys) if and when
found. This would also reduce the effort expended when there are
contradictory scan keys, as array preprocessing is relatively more
expensive than other scankeys and contradictions are then found before
processing of later keys.
As I wasn't very far into the work yet it seems I can reuse a lot of
your work here.

For example, consider this query:

[...]

This has a total of 6 input scankeys -- 3 of which are arrays. But by
the time v15's _bt_preprocess_keys is done with it, it'll have only 1
scan key -- which doesn't even have an array (not anymore). And so we
won't even need to advance the array keys one single time -- there'll
simply be no array left to advance. In other words, it'll be just as
if the query was written this way from the start:

select *
from multi_test
where
a = 3::int;

(Though of course the original query will spend more cycles on
preprocessing, compared to this manually simplified variant.)

That's a good improvement, much closer to an optimal access pattern.

It turned out to not be terribly difficult to teach
_bt_preprocess_keys everything it could possibly need to know about
arrays, so that it can operate on them directly, as a variant of the
standard equality strategy (we do still need _bt_preprocess_array_keys
for basic preprocessing of arrays, mostly just merging them). This is
better overall (in that it gets every little subtlety right), but it
also simplified a number of related issues. For example, there is no
longer any need to maintain a mapping between so->keyData[]-wise scan
keys (output scan keys), and scan->keyData[]-wise scan keys (input
scan keys). We can just add a step to fix-up the references to the end
of _bt_preprocess_keys, to make life easier within
_bt_advance_array_keys.

This preprocessing work should all be happening during planning, not
during query execution -- that's the design that makes the most sense.
This is something we've discussed in the past in the context of skip
scan (see my original email to this thread for the reference).

Yes, but IIRC we also agreed that it's impossible to do this fully in
planning, amongst others due to joins on array fields.

It
would be especially useful for the very fancy kinds of preprocessing
that are described by the MDAM paper, like using an index scan for a
NOT IN() list/array (this can actually make sense with a low
cardinality index).

Yes, indexes such as those on enums. Though, in those cases the NOT IN
() could be transformed into IN()-lists by the planner, but not the
index.

The structure for preprocessing that I'm working towards (especially
in v15) sets the groundwork for making those shifts in the planner,
because we'll no longer treat each array constant as its own primitive
index scan during preprocessing.

I hope that's going to be a fully separate patch. I don't think I can
handle much more complexity in this one.

Right now, on HEAD, preprocessing
with arrays kinda treats each array constant like the parameter of an
imaginary inner index scan, from an imaginary nested loop join. But
the planner really needs to operate on the whole qual, all at once,
including any arrays. An actual nestloop join's inner index scan
naturally cannot do that, and so might still require runtime/dynamic
preprocessing in a world where that mostly happens in the planner --
but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE
a in(4, 5)" are almost the same thing, and so should be handled in
almost the same way by preprocessing).

Yeah, if the planner could handle some of this that'd be great. At the
same time, I think that this might need to be gated behind a guc for
more expensive planner-time deductions.

+     * We generally permit primitive index scans to continue onto the next
+     * sibling page when the page's finaltup satisfies all required scan keys
+     * at the point where we're between pages.
This should probably describe that we permit primitive scans with
array keys to continue until we get to the sibling page, rather than
this rather obvious and generic statement that would cover even the
index scan for id > 0 ORDER BY id asc; or this paragraph can be
removed.
It's not quite obvious.

[...]

Separately, there is also the potential for undesirable interactions
between 1 and 2, which is why we don't let them mix. (We have the "if
(so->scanBehind && has_required_opposite_direction_only) goto
new_prim_scan" gating condition.)

I see.

Further notes:

I have yet to fully grasp what so->scanBehind is supposed to mean. "/*
Scan might be behind arrays? */" doesn't give me enough hints here.

Yes, it is complicated. The best explanation is the one I've added to
_bt_readpage, next to the precheck. But that does need more work.

Yeah. The _bt_readpage comment doesn't actually contain the search
term scanBehind, so I wasn't expecting that to be documented there.

I find it weird that we call _bt_advance_array_keys for non-required
sktrig. Shouldn't that be as easy as doing a binary search through the
array? Why does this need to hit the difficult path?

What difficult path?

"Expensive" would probably have been a better wording: we do a
comparative lot of processing in the !_bt_check_compare() +
!continuescan path; much more than the binary searches you'd need for
non-required array key checks.

Advancing non-required arrays isn't that different to advancing
required ones. We will never advance required arrays when called just
to advance a non-required one, obviously. But we can do the opposite.
In fact, we absolutely have to advance non-required arrays (as best we
can) when advancing a required one (or when the call was triggered by
a non-array required scan key).

I think it's a lot more expensive to do the non-required array key
increment for non-required triggers. What are we protecting against
(or improving) by always doing advance_array_keys on non-required
trigger keys?

I mean that we should just do the non-required array key binary search
inside _bt_check_compare for non-required array keys, as that would
skip a lot of the rather expensive other array key infrastructure, and
only if we're outside the minimum or maximum bounds of the
non-required scankeys should we trigger advance_array_keys (unless
scan direction changed).

A full review of the updated patch will follow soon.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#54

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#52)

2 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I've attached v14, where 0001 is v13, 0002 is a patch with small
changes + some large comment suggestions, and 0003 which contains
sorted merge join code for _bt_merge_arrays.

This is part of my next revision, v15, which I've attached (along with
a test case that you might find useful, explained below).

v15 makes the difference between the non-required scan key trigger and
required scan key trigger cases clearer within _bt_advance_array_keys.

OK, here's a small additional review, with a suggestion for additional
changes to _bt_preprocess:

@@ -1117,6 +3160,46 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
/*
* The opfamily we need to worry about is identified by the index column.
*/
Assert(leftarg->sk_attno == rightarg->sk_attno);
+    /*
+     * If either leftarg or rightarg are equality-type array scankeys, we need
+     * specialized handling (since by now we know that IS NULL wasn't used)
+     */
[...]
+    }
+
opcintype = rel->rd_opcintype[leftarg->sk_attno - 1];

Here, you insert your code between the comment about which opfamily to
choose and the code assigning the opfamily. I think this needs some
cleanup.

+             * Don't call _bt_preprocess_array_keys_final in this fast path
+             * (we'll miss out on the single value array transformation, but
+             * that's not nearly as important when there's only one scan key)

Why is it OK to ignore it? Or, why don't we apply it here?

---

Attached 2 patches for further optimization of the _bt_preprocess_keys
path (on top of your v15), according to the following idea:

Right now, we do "expensive" processing with xform merging for all
keys when we have more than 1 keys in the scan. However, we only do
per-attribute merging of these keys, so if there is only one key for
any attribute, the many cycles spent in that loop are mostly wasted.
By checking for single-scankey attributes, we can short-track many
multi-column index scans because they usually have only a single scan
key per attribute.
The first implements that idea, the second reduces the scope of
various variables so as to improve compiler optimizability.

I'll try to work a bit more on merging the _bt_preprocess steps into a
single main iterator, but this is about as far as I got with clean
patches. Merging the steps for array preprocessing with per-key
processing and post-processing is proving a bit more complex than I'd
anticipated, so I don't think I'll be able to finish that before the
feature freeze, especially with other things that keep distracting me.

Matthias van de Meent

Attachments:

v1-0001-nbtree-Optimize-preprocessing-for-single-ScanKey-.patch.txttext/plain; charset=US-ASCII; name=v1-0001-nbtree-Optimize-preprocessing-for-single-ScanKey-.patch.txtDownload

From 16808e85f5fae7b16fd52d8a2be8437e4cff8640 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 19 Mar 2024 16:39:29 +0100
Subject: [PATCH v1 1/2] nbtree: Optimize preprocessing for single ScanKey per
 column

Before, there was only a single fast path: single-ScanKey index scans.
With this optimization, we can fast-track processing of attributes with
only a single scankey, significantly reducing overhead for common scans
like "a = 10 and b < 8".
---
 src/backend/access/nbtree/nbtutils.c | 529 ++++++++++++++-------------
 1 file changed, 278 insertions(+), 251 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index b401b31191..8cd6270408 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2511,7 +2511,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
-	BTScanKeyPreproc xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
@@ -2619,327 +2618,355 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	 * xform[i] points to the currently best scan key of strategy type i+1; it
 	 * is NULL if we haven't yet found such a key for this attr.
 	 */
-	attno = 1;
-	memset(xform, 0, sizeof(xform));
 
-	/*
-	 * Loop iterates from 0 to numberOfKeys inclusive; we use the last pass to
-	 * handle after-last-key processing.  Actual exit from the loop is at the
-	 * "break" statement below.
-	 */
-	for (i = 0;; cur++, i++)
+	for (i = 0; i < numberOfKeys;)
 	{
-		if (i < numberOfKeys)
+		BTScanKeyPreproc xform[BTMaxStrategyNumber];
+		int			priorNumberOfEqualCols = numberOfEqualCols;
+		bool		fast = true;
+
+		/* initialize for this attno */
+		attno = cur->sk_attno;
+
+		/* We'll try to merge keys if there are multiple for this attribute */
+		if (i + 1 < numberOfKeys && (cur + 1)->sk_attno == attno)
+			fast = false;
+
+		/* ... and have special care-taking for arrays and rows */
+		if (cur->sk_flags & (SK_SEARCHARRAY | SK_ROW_HEADER))
+			fast = false;
+
+		/* no special array/row/multi-key processing */
+		if (fast)
 		{
-			/* Apply indoption to scankey (might change sk_strategy!) */
+			ScanKey outkey;
 			if (!_bt_fix_scankey_strategy(cur, indoption))
 			{
-				/* NULL can't be matched, so give up */
 				so->qual_ok = false;
 				return;
 			}
+
+			/* mark key required if needed */
+			if (numberOfEqualCols == attno - 1)
+				_bt_mark_scankey_required(cur);
+			/* ... and update the number of equal keys if needed */
+			if (cur->sk_strategy == BTEqualStrategyNumber)
+				numberOfEqualCols++;
+			if (arrayKeyData)
+				keyDataMap[new_numberOfKeys] = i;
+
+			/* Copy the key into the output key data */
+			outkey = &outkeys[new_numberOfKeys++];
+			memcpy(outkey, cur, sizeof(ScanKeyData));
+
+			cur++;
+			i++;
+			continue;
 		}
 
+		memset(xform, 0, sizeof(xform));
+
 		/*
-		 * If we are at the end of the keys for a particular attr, finish up
-		 * processing and emit the cleaned-up keys.
+		 * Iterate over all ScanKeys for the current attno, collecting the
+		 * most restrictive keys into xform.
 		 */
-		if (i == numberOfKeys || cur->sk_attno != attno)
+		for (;i < numberOfKeys && cur->sk_attno == attno; cur++, i++)
 		{
-			int			priorNumberOfEqualCols = numberOfEqualCols;
+			/* Apply indoption to scankey (might change sk_strategy!) */
+			if (!_bt_fix_scankey_strategy(cur, indoption))
+			{
+				/* NULL can't be matched, so give up */
+				so->qual_ok = false;
+				return;
+			}
 
-			/* check input keys are correctly ordered */
-			if (i < numberOfKeys && cur->sk_attno < attno)
-				elog(ERROR, "btree index keys must be ordered by attribute");
+			/* check strategy this key's operator corresponds to */
+			j = cur->sk_strategy - 1;
 
-			/*
-			 * If = has been specified, all other keys can be eliminated as
-			 * redundant.  If we have a case like key = 1 AND key > 2, we can
-			 * set qual_ok to false and abandon further processing.
-			 *
-			 * We also have to deal with the case of "key IS NULL", which is
-			 * unsatisfiable in combination with any other index condition. By
-			 * the time we get here, that's been classified as an equality
-			 * check, and we've rejected any combination of it with a regular
-			 * equality condition; but not with other types of conditions.
-			 */
-			if (xform[BTEqualStrategyNumber - 1].skey)
+			/* if row comparison, push it directly to the output array */
+			if (cur->sk_flags & SK_ROW_HEADER)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
-				BTArrayKeyInfo *array = NULL;
-				FmgrInfo   *orderproc = NULL;
+				ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-				if (arrayKeyData && (eq->sk_flags & SK_SEARCHARRAY))
-				{
-					int			eq_in_ikey,
-								eq_arrayidx;
+				memcpy(outkey, cur, sizeof(ScanKeyData));
+				if (arrayKeyData)
+					keyDataMap[new_numberOfKeys - 1] = i;
+				if (numberOfEqualCols == attno - 1)
+					_bt_mark_scankey_required(outkey);
 
-					eq_in_ikey = xform[BTEqualStrategyNumber - 1].ikey;
-					eq_arrayidx = xform[BTEqualStrategyNumber - 1].arrayidx;
-					array = &so->arrayKeys[eq_arrayidx - 1];
-					orderproc = so->orderProcs + eq_in_ikey;
+				/*
+				 * We don't support RowCompare using equality; such a qual would
+				 * mess up the numberOfEqualCols tracking.
+				 */
+				Assert(j != (BTEqualStrategyNumber - 1));
+				continue;
+			}
 
-					Assert(array->scan_key == eq_in_ikey);
-					Assert(OidIsValid(orderproc->fn_oid));
-				}
+			/*
+			 * Does this input scan key require further processing as an array?
+			 */
+			if (cur->sk_strategy == InvalidStrategy)
+			{
+				/* _bt_preprocess_array_keys marked this array key redundant */
+				Assert(arrayKeyData);
+				Assert(cur->sk_flags & SK_SEARCHARRAY);
+				continue;
+			}
 
-				for (j = BTMaxStrategyNumber; --j >= 0;)
-				{
-					ScanKey		chk = xform[j].skey;
+			if (cur->sk_strategy == BTEqualStrategyNumber &&
+				(cur->sk_flags & SK_SEARCHARRAY))
+			{
+				/* _bt_preprocess_array_keys kept this array key */
+				Assert(arrayKeyData);
+				arrayidx++;
+			}
 
-					if (!chk || j == (BTEqualStrategyNumber - 1))
-						continue;
+			/*
+			 * have we seen a scan key for this same attribute and using this same
+			 * operator strategy before now?
+			 */
+			if (xform[j].skey == NULL)
+			{
+				/* nope, so this scan key wins by default (at least for now) */
+				xform[j].skey = cur;
+				xform[j].ikey = i;
+				xform[j].arrayidx = arrayidx;
+			}
+			else
+			{
+				FmgrInfo   *orderproc = NULL;
+				BTArrayKeyInfo *array = NULL;
 
-					if (eq->sk_flags & SK_SEARCHNULL)
+				/*
+				 * Seen one of these before, so keep only the more restrictive key
+				 * if possible
+				 */
+				if (j == (BTEqualStrategyNumber - 1) && arrayKeyData)
+				{
+					/*
+					 * Have to set up array keys
+					 */
+					if ((cur->sk_flags & SK_SEARCHARRAY))
 					{
-						/* IS NULL is contradictory to anything else */
-						so->qual_ok = false;
-						return;
-					}
+						array = &so->arrayKeys[arrayidx - 1];
+						orderproc = so->orderProcs + i;
 
-					if (_bt_compare_scankey_args(scan, chk, eq, chk,
-												 array, orderproc,
-												 &test_result))
-					{
-						if (!test_result)
-						{
-							/* keys proven mutually contradictory */
-							so->qual_ok = false;
-							return;
-						}
-						/* else discard the redundant non-equality key */
-						Assert(!array || array->num_elems > 0);
-						xform[j].skey = NULL;
-						xform[j].ikey = -1;
+						Assert(array->scan_key == i);
+						Assert(OidIsValid(orderproc->fn_oid));
 					}
-					/* else, cannot determine redundancy, keep both keys */
-				}
-				/* track number of attrs for which we have "=" keys */
-				numberOfEqualCols++;
-			}
+					else if ((xform[j].skey->sk_flags & SK_SEARCHARRAY))
+					{
+						array = &so->arrayKeys[xform[j].arrayidx - 1];
+						orderproc = so->orderProcs + xform[j].ikey;
 
-			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1].skey
-				&& xform[BTLessEqualStrategyNumber - 1].skey)
-			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
+						Assert(array->scan_key == xform[j].ikey);
+						Assert(OidIsValid(orderproc->fn_oid));
+					}
 
-				if (_bt_compare_scankey_args(scan, le, lt, le, NULL, NULL,
-											 &test_result))
-				{
-					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
-					else
-						xform[BTLessStrategyNumber - 1].skey = NULL;
+					/*
+					 * Both scan keys might have arrays, in which case we'll
+					 * arbitrarily pass only one of the arrays.  That won't
+					 * matter, since _bt_compare_scankey_args is aware that two
+					 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
+					 * failed to eliminate redundant arrays through array merging.
+					 * _bt_compare_scankey_args just returns false when it sees
+					 * this; it won't even try to examine either array.
+					 */
 				}
-			}
 
-			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1].skey
-				&& xform[BTGreaterEqualStrategyNumber - 1].skey)
-			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
-
-				if (_bt_compare_scankey_args(scan, ge, gt, ge, NULL, NULL,
-											 &test_result))
+				if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+											 array, orderproc, &test_result))
 				{
+					/* Have all we need to determine redundancy */
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
-					else
-						xform[BTGreaterStrategyNumber - 1].skey = NULL;
-				}
-			}
+					{
+						Assert(!array || array->num_elems > 0);
 
-			/*
-			 * Emit the cleaned-up keys into the outkeys[] array, and then
-			 * mark them if they are required.  They are required (possibly
-			 * only in one direction) if all attrs before this one had "=".
-			 */
-			for (j = BTMaxStrategyNumber; --j >= 0;)
-			{
-				if (xform[j].skey)
+						/*
+						 * New key is more restrictive, and so replaces old key...
+						 */
+						if (j != (BTEqualStrategyNumber - 1) ||
+							!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
+						{
+							Assert(!array || array->scan_key == i);
+							xform[j].skey = cur;
+							xform[j].ikey = i;
+							xform[j].arrayidx = arrayidx;
+						}
+						else
+						{
+							/*
+							 * ...unless we have to keep the old key because it's
+							 * an array that rendered the new key redundant.  We
+							 * need to make sure that we don't throw away an array
+							 * scan key.  _bt_compare_scankey_args expects us to
+							 * always keep arrays (and discard non-arrays).
+							 */
+							Assert(j == (BTEqualStrategyNumber - 1));
+							Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
+							Assert(xform[j].ikey == array->scan_key);
+							Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+						}
+					}
+					else if (j == (BTEqualStrategyNumber - 1))
+					{
+						/* key == a && key == b, but a != b */
+						so->qual_ok = false;
+						return;
+					}
+					/* else old key is more restrictive, keep it */
+				}
+				else
 				{
+					/*
+					 * We can't determine which key is more restrictive.  Push
+					 * xform[j] directly to the output array, then set xform[j] to
+					 * the new scan key.
+					 *
+					 * Note: We do things this way around so that our arrays are
+					 * always in the same order as their corresponding scan keys,
+					 * even with incomplete opfamilies.  _bt_advance_array_keys
+					 * depends on this.
+					 */
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
 					if (arrayKeyData)
 						keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
-					if (priorNumberOfEqualCols == attno - 1)
+					if (numberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
+					xform[j].skey = cur;
+					xform[j].ikey = i;
+					xform[j].arrayidx = arrayidx;
 				}
 			}
-
-			/*
-			 * Exit loop here if done.
-			 */
-			if (i == numberOfKeys)
-				break;
-
-			/* Re-initialize for new attno */
-			attno = cur->sk_attno;
-			memset(xform, 0, sizeof(xform));
-		}
-
-		/* check strategy this key's operator corresponds to */
-		j = cur->sk_strategy - 1;
-
-		/* if row comparison, push it directly to the output array */
-		if (cur->sk_flags & SK_ROW_HEADER)
-		{
-			ScanKey		outkey = &outkeys[new_numberOfKeys++];
-
-			memcpy(outkey, cur, sizeof(ScanKeyData));
-			if (arrayKeyData)
-				keyDataMap[new_numberOfKeys - 1] = i;
-			if (numberOfEqualCols == attno - 1)
-				_bt_mark_scankey_required(outkey);
-
-			/*
-			 * We don't support RowCompare using equality; such a qual would
-			 * mess up the numberOfEqualCols tracking.
-			 */
-			Assert(j != (BTEqualStrategyNumber - 1));
-			continue;
 		}
 
 		/*
-		 * Does this input scan key require further processing as an array?
+		 * Now that we are at the end of the keys for a particular attr,
+		 * finish up processing and emit the cleaned-up keys.
 		 */
-		if (cur->sk_strategy == InvalidStrategy)
-		{
-			/* _bt_preprocess_array_keys marked this array key redundant */
-			Assert(arrayKeyData);
-			Assert(cur->sk_flags & SK_SEARCHARRAY);
-			continue;
-		}
 
-		if (cur->sk_strategy == BTEqualStrategyNumber &&
-			(cur->sk_flags & SK_SEARCHARRAY))
-		{
-			/* _bt_preprocess_array_keys kept this array key */
-			Assert(arrayKeyData);
-			arrayidx++;
-		}
+		/* check input keys are correctly ordered */
+		if (i < numberOfKeys && cur->sk_attno < attno)
+			elog(ERROR, "btree index keys must be ordered by attribute");
 
 		/*
-		 * have we seen a scan key for this same attribute and using this same
-		 * operator strategy before now?
+		 * If = has been specified, all other keys can be eliminated as
+		 * redundant.  If we have a case like key = 1 AND key > 2, we can
+		 * set qual_ok to false and abandon further processing.
+		 *
+		 * We also have to deal with the case of "key IS NULL", which is
+		 * unsatisfiable in combination with any other index condition. By
+		 * the time we get here, that's been classified as an equality
+		 * check, and we've rejected any combination of it with a regular
+		 * equality condition; but not with other types of conditions.
 		 */
-		if (xform[j].skey == NULL)
+		if (xform[BTEqualStrategyNumber - 1].skey)
 		{
-			/* nope, so this scan key wins by default (at least for now) */
-			xform[j].skey = cur;
-			xform[j].ikey = i;
-			xform[j].arrayidx = arrayidx;
-		}
-		else
-		{
-			FmgrInfo   *orderproc = NULL;
+			ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
 			BTArrayKeyInfo *array = NULL;
+			FmgrInfo   *orderproc = NULL;
 
-			/*
-			 * Seen one of these before, so keep only the more restrictive key
-			 * if possible
-			 */
-			if (j == (BTEqualStrategyNumber - 1) && arrayKeyData)
+			if (arrayKeyData && (eq->sk_flags & SK_SEARCHARRAY))
 			{
-				/*
-				 * Have to set up array keys
-				 */
-				if ((cur->sk_flags & SK_SEARCHARRAY))
-				{
-					array = &so->arrayKeys[arrayidx - 1];
-					orderproc = so->orderProcs + i;
-
-					Assert(array->scan_key == i);
-					Assert(OidIsValid(orderproc->fn_oid));
-				}
-				else if ((xform[j].skey->sk_flags & SK_SEARCHARRAY))
-				{
-					array = &so->arrayKeys[xform[j].arrayidx - 1];
-					orderproc = so->orderProcs + xform[j].ikey;
+				int			eq_in_ikey,
+							eq_arrayidx;
 
-					Assert(array->scan_key == xform[j].ikey);
-					Assert(OidIsValid(orderproc->fn_oid));
-				}
+				eq_in_ikey = xform[BTEqualStrategyNumber - 1].ikey;
+				eq_arrayidx = xform[BTEqualStrategyNumber - 1].arrayidx;
+				array = &so->arrayKeys[eq_arrayidx - 1];
+				orderproc = so->orderProcs + eq_in_ikey;
 
-				/*
-				 * Both scan keys might have arrays, in which case we'll
-				 * arbitrarily pass only one of the arrays.  That won't
-				 * matter, since _bt_compare_scankey_args is aware that two
-				 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
-				 * failed to eliminate redundant arrays through array merging.
-				 * _bt_compare_scankey_args just returns false when it sees
-				 * this; it won't even try to examine either array.
-				 */
+				Assert(array->scan_key == eq_in_ikey);
+				Assert(OidIsValid(orderproc->fn_oid));
 			}
 
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
-										 array, orderproc, &test_result))
+			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				/* Have all we need to determine redundancy */
-				if (test_result)
-				{
-					Assert(!array || array->num_elems > 0);
+				ScanKey		chk = xform[j].skey;
 
-					/*
-					 * New key is more restrictive, and so replaces old key...
-					 */
-					if (j != (BTEqualStrategyNumber - 1) ||
-						!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
-					{
-						Assert(!array || array->scan_key == i);
-						xform[j].skey = cur;
-						xform[j].ikey = i;
-						xform[j].arrayidx = arrayidx;
-					}
-					else
-					{
-						/*
-						 * ...unless we have to keep the old key because it's
-						 * an array that rendered the new key redundant.  We
-						 * need to make sure that we don't throw away an array
-						 * scan key.  _bt_compare_scankey_args expects us to
-						 * always keep arrays (and discard non-arrays).
-						 */
-						Assert(j == (BTEqualStrategyNumber - 1));
-						Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
-						Assert(xform[j].ikey == array->scan_key);
-						Assert(!(cur->sk_flags & SK_SEARCHARRAY));
-					}
-				}
-				else if (j == (BTEqualStrategyNumber - 1))
+				if (!chk || j == (BTEqualStrategyNumber - 1))
+					continue;
+
+				if (eq->sk_flags & SK_SEARCHNULL)
 				{
-					/* key == a && key == b, but a != b */
+					/* IS NULL is contradictory to anything else */
 					so->qual_ok = false;
 					return;
 				}
-				/* else old key is more restrictive, keep it */
+
+				if (_bt_compare_scankey_args(scan, chk, eq, chk,
+											 array, orderproc,
+											 &test_result))
+				{
+					if (!test_result)
+					{
+						/* keys proven mutually contradictory */
+						so->qual_ok = false;
+						return;
+					}
+					/* else discard the redundant non-equality key */
+					Assert(!array || array->num_elems > 0);
+					xform[j].skey = NULL;
+					xform[j].ikey = -1;
+				}
+				/* else, cannot determine redundancy, keep both keys */
 			}
-			else
+			/* track number of attrs for which we have "=" keys */
+			numberOfEqualCols++;
+		}
+
+		/* try to keep only one of <, <= */
+		if (xform[BTLessStrategyNumber - 1].skey
+			&& xform[BTLessEqualStrategyNumber - 1].skey)
+		{
+			ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+			ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
+
+			if (_bt_compare_scankey_args(scan, le, lt, le, NULL, NULL,
+										 &test_result))
+			{
+				if (test_result)
+					xform[BTLessEqualStrategyNumber - 1].skey = NULL;
+				else
+					xform[BTLessStrategyNumber - 1].skey = NULL;
+			}
+		}
+
+		/* try to keep only one of >, >= */
+		if (xform[BTGreaterStrategyNumber - 1].skey
+			&& xform[BTGreaterEqualStrategyNumber - 1].skey)
+		{
+			ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+			ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
+
+			if (_bt_compare_scankey_args(scan, ge, gt, ge, NULL, NULL,
+										 &test_result))
+			{
+				if (test_result)
+					xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
+				else
+					xform[BTGreaterStrategyNumber - 1].skey = NULL;
+			}
+		}
+
+		/*
+		 * Emit the cleaned-up keys into the outkeys[] array, and then
+		 * mark them if they are required.  They are required (possibly
+		 * only in one direction) if all attrs before this one had "=".
+		 */
+		for (j = BTMaxStrategyNumber; --j >= 0;)
+		{
+			if (xform[j].skey)
 			{
-				/*
-				 * We can't determine which key is more restrictive.  Push
-				 * xform[j] directly to the output array, then set xform[j] to
-				 * the new scan key.
-				 *
-				 * Note: We do things this way around so that our arrays are
-				 * always in the same order as their corresponding scan keys,
-				 * even with incomplete opfamilies.  _bt_advance_array_keys
-				 * depends on this.
-				 */
 				ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 				memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
 				if (arrayKeyData)
 					keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
-				if (numberOfEqualCols == attno - 1)
+				if (priorNumberOfEqualCols == attno - 1)
 					_bt_mark_scankey_required(outkey);
-				xform[j].skey = cur;
-				xform[j].ikey = i;
-				xform[j].arrayidx = arrayidx;
 			}
 		}
 	}
-- 
2.40.1

v1-0002-nbtree-Limit-scope-of-variables-in-_bt_preprocess.patch.txttext/plain; charset=US-ASCII; name=v1-0002-nbtree-Limit-scope-of-variables-in-_bt_preprocess.patch.txtDownload

From 1208e00a45076acc028c5435be7ad56e5e031051 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 19 Mar 2024 20:22:42 +0100
Subject: [PATCH v1 2/2] nbtree: Limit scope of variables in
 _bt_preprocess_keys

Many variables were used and rewritten across various loops in the code,
but we didn't actually depend on their values across loops. By limiting
their scope, we show this to the compiler, too, so that it can better
optimize this code.
---
 src/backend/access/nbtree/nbtutils.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8cd6270408..d7b30f8b9b 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2511,10 +2511,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
-	bool		test_result;
-	int			i,
-				j;
-	AttrNumber	attno;
 	ScanKey		arrayKeyData;
 	int		   *keyDataMap = NULL;
 	int			arrayidx = 0;
@@ -2619,11 +2615,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	 * is NULL if we haven't yet found such a key for this attr.
 	 */
 
-	for (i = 0; i < numberOfKeys;)
+	for (int i = 0; i < numberOfKeys;)
 	{
 		BTScanKeyPreproc xform[BTMaxStrategyNumber];
 		int			priorNumberOfEqualCols = numberOfEqualCols;
 		bool		fast = true;
+		AttrNumber	attno;
 
 		/* initialize for this attno */
 		attno = cur->sk_attno;
@@ -2672,6 +2669,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		 */
 		for (;i < numberOfKeys && cur->sk_attno == attno; cur++, i++)
 		{
+			bool		test_result;
+			int			j;
 			/* Apply indoption to scankey (might change sk_strategy!) */
 			if (!_bt_fix_scankey_strategy(cur, indoption))
 			{
@@ -2882,9 +2881,10 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				Assert(OidIsValid(orderproc->fn_oid));
 			}
 
-			for (j = BTMaxStrategyNumber; --j >= 0;)
+			for (int j = BTMaxStrategyNumber; --j >= 0;)
 			{
 				ScanKey		chk = xform[j].skey;
+				bool		test_result;
 
 				if (!chk || j == (BTEqualStrategyNumber - 1))
 					continue;
@@ -2923,6 +2923,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		{
 			ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
 			ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
+			bool		test_result;
 
 			if (_bt_compare_scankey_args(scan, le, lt, le, NULL, NULL,
 										 &test_result))
@@ -2940,6 +2941,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		{
 			ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
 			ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
+			bool		test_result;
 
 			if (_bt_compare_scankey_args(scan, ge, gt, ge, NULL, NULL,
 										 &test_result))
@@ -2956,7 +2958,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		 * mark them if they are required.  They are required (possibly
 		 * only in one direction) if all attrs before this one had "=".
 		 */
-		for (j = BTMaxStrategyNumber; --j >= 0;)
+		for (int j = BTMaxStrategyNumber; --j >= 0;)
 		{
 			if (xform[j].skey)
 			{
-- 
2.40.1

#55

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Benoit Tigeot (#48)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Mar 7, 2024 at 10:42 AM Benoit Tigeot <benoit.tigeot@gmail.com> wrote:

I am not up to date with the last version of patch but I did a regular benchmark with version 11 and typical issue we have at the moment and the result are still very very good. [1]

Thanks for providing the test case. It was definitely important back
when the ideas behind the patch had not yet fully developed. It helped
me to realize that my thinking around non-required arrays (meaning
arrays that cannot reposition the scan, and just filter out
non-matching tuples) was still sloppy.

In term of performance improvement the last proposals could be a real game changer for 2 of our biggest databases. We hope that Postgres 17 will contain those improvements.

Current plan is to commit this patch in the next couple of weeks,
ahead of Postgres 17 feature freeze.

--
Peter Geoghegan

#56

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#53)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Mar 18, 2024 at 9:25 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I was thinking about a more unified processing model, where
_bt_preprocess_keys would iterate over all keys, including processing
of array keys (by merging and reduction to normal keys) if and when
found. This would also reduce the effort expended when there are
contradictory scan keys, as array preprocessing is relatively more
expensive than other scankeys and contradictions are then found before
processing of later keys.

Does it really matter? I just can't get excited about maybe getting
one less syscache lookup in queries that involve *obviously*
contradictory keys at the SQL level. Especially because we're so much
better off with the new design here anyway; calling
_bt_preprocess_keys once rather than once per distinct set of array
keys is an enormous improvement, all on its own.

My concern with preprocessing overhead is almost entirely limited to
pathological performance issues involving extreme (even adversarial)
input scan keys/arrays. I feel that if we're going to guarantee that
we won't choke in _bt_checkkeys, even given billions of distinct sets
of array keys, we should make the same guarantee in
_bt_preprocess_keys -- just to avoid POLA violations. But that's the
only thing that seems particularly important in the general area of
preprocessing performance. (Preprocessing performance probably does
matter quite a bit in more routine cases, where </<= and >/>= are
mixed together on the same attribute. But there'll be no new
_bt_preprocess_keys operator/function lookups for stuff like that.)

The advantage of not completely merging _bt_preprocess_array_keys with
_bt_preprocess_keys is that preserving this much of the existing code
structure allows us to still decide how many array keys we'll need for
the scan up front (if the scan doesn't end up having an unsatisfiable
qual). _bt_preprocess_array_keys will eliminate redundant arrays
early, in practically all cases (the exception is when it has to deal
with an incomplete opfamily lacking the appropriate cross-type ORDER
proc), so we don't have to think about merging arrays after that
point. Rather like how we don't have to worry about "WHERE a <
any('{1, 2, 3}')" type inequalities after _bt_preprocess_array_keys
does an initial round of array-specific preprocessing
(_bt_preprocess_keys can deal with those in the same way as it will
with standard inequalities).

This preprocessing work should all be happening during planning, not
during query execution -- that's the design that makes the most sense.
This is something we've discussed in the past in the context of skip
scan (see my original email to this thread for the reference).

Yes, but IIRC we also agreed that it's impossible to do this fully in
planning, amongst others due to joins on array fields.

Even with a nested loop join's inner index scan, where the constants
used for each btrescan are not really predictable in the planner, we
can still do most preprocessing in the planner, at least most of the
time.

We could still easily do analysis that is capable of ruling out
redundant or contradictory scan keys for any possible parameter value
-- seen or unseen. I'd expect this to be the common case -- most of
the time these inner index scans need only one simple = operator
(maybe 2 = operators). Obviously, tjat approach still requires that
btrescan at least accept a new set of constants for each new inner
index scan invocation. But that's still much cheaper than calling
_bt_preprocess_keys from scratch every time btresca is called.

It
would be especially useful for the very fancy kinds of preprocessing
that are described by the MDAM paper, like using an index scan for a
NOT IN() list/array (this can actually make sense with a low
cardinality index).

Yes, indexes such as those on enums. Though, in those cases the NOT IN
() could be transformed into IN()-lists by the planner, but not the
index.

I think that it would probably be built as just another kind of index
path, like the ones we build for SAOPs. Anti-SAOPs?

Just as with SAOPs, the disjunctive accesses wouldn't really be
something that the planner would need too much direct understanding of
(except during costing). I'd only expect the plan to use such an index
scan when it wasn't too far off needing a full index scan anyway. Just
like with skip scan, the distinction between these NOT IN() index scan
paths and a simple full index scan path is supposed to be fuzzy (maybe
they'll actually turn out to be full scans at runtime, since the
number of primitive index scans is determined dynamically, based on
the structure of the index at runtime).

The structure for preprocessing that I'm working towards (especially
in v15) sets the groundwork for making those shifts in the planner,
because we'll no longer treat each array constant as its own primitive
index scan during preprocessing.

I hope that's going to be a fully separate patch. I don't think I can
handle much more complexity in this one.

Allowing the planner to hook into _bt_preprocess_keys is absolutely
not in scope here. I was just making the point that treating array
scan keys like just another type of scan key during preprocessing is
going to help with that too.

Yeah. The _bt_readpage comment doesn't actually contain the search
term scanBehind, so I wasn't expecting that to be documented there.

Can you think of a better way of explaining it?

I find it weird that we call _bt_advance_array_keys for non-required
sktrig. Shouldn't that be as easy as doing a binary search through the
array? Why does this need to hit the difficult path?

What difficult path?

"Expensive" would probably have been a better wording: we do a
comparative lot of processing in the !_bt_check_compare() +
!continuescan path; much more than the binary searches you'd need for
non-required array key checks.

I've come around to your point of view on this -- at least to a
degree. I'm now calling _bt_advance_array_keys from within
_bt_check_compare for non-required array keys only.

If nothing else, it's clearer this way because it makes it obvious
that non-required arrays cannot end the (primitive) scan. There's no
further need for the wart in _bt_check_compare's contract about
"continuescan=false" being associated with a non-required scan key in
this one special case.

Advancing non-required arrays isn't that different to advancing
required ones. We will never advance required arrays when called just
to advance a non-required one, obviously. But we can do the opposite.
In fact, we absolutely have to advance non-required arrays (as best we
can) when advancing a required one (or when the call was triggered by
a non-array required scan key).

I think it's a lot more expensive to do the non-required array key
increment for non-required triggers. What are we protecting against
(or improving) by always doing advance_array_keys on non-required
trigger keys?

We give up on the first non-required array that fails to find an exact
match, though. Regardless of whether the scan was triggered by a
required or a non-required scan key (it's equally useful in both types
of array advancement, because they're not all that different).

There were and are steps around starting a new primitive index scan at
the end of _bt_advance_array_keys, that aren't required when array
advancement was triggered by an unsatisfied non-required array scan
key. But those steps were skipped given a non-required trigger scan
key, anyway.

I mean that we should just do the non-required array key binary search
inside _bt_check_compare for non-required array keys, as that would
skip a lot of the rather expensive other array key infrastructure, and
only if we're outside the minimum or maximum bounds of the
non-required scankeys should we trigger advance_array_keys (unless
scan direction changed).

I've thought about optimizing non-required arrays along those lines.
But it doesn't really seem to be necessary at all.

If we were going to do it, then it ought to be done in a way that
works independent of the trigger condition for array advancement (that
is, it'd work for non-required arrays that have a required scan key
trigger, too).

--
Peter Geoghegan

#57

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#54)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Mar 20, 2024 at 3:26 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

OK, here's a small additional review, with a suggestion for additional
changes to _bt_preprocess:

Attached is v16. This has some of the changes that you asked for. But
my main focus has been on fixing regressions that were discovered
during recent performance validation.

Up until now, my stress testing of the patch (not quite the same thing
as benchmarking) more or less consisted of using pgbench to look at
two different types of cases. These are:

1. Extreme cases where the patch obviously helps the most, at least
insofar as navigating the index structure itself is concerned (note
that avoiding heap accesses isn't in scope for the pgbench stress
testing).

These are all variants of pgbench's SELECT workload, with a huge IN()
list containing contiguous integers (we can get maybe a 5.3x increase
in TPS here), or integers that are spaced apart by maybe 20 - 50
tuples (where we can get maybe a 20% - 30% improvement in TPS here).

2. The opposite extreme, where an IN() list has integers that are
essentially random, and so are spaced very far apart on average.

Another variant of pgbench's SELECT workload, with an IN() list. This
is a case that the patch has hardly any chance of helping, so the goal
here is to just avoid regressions. Since I've been paying attention to
this for a long time, this wasn't really a problem for v15.

3. Cases where matching tuples are spaced apart by 100 - 300
tuples/integer values.

These cases are "somewhere between 1 and 2". Cases where I would
expect to win by a smaller amount (say a 5% improvement in TPS), but
v15 nevertheless lost by as much as 12% here. This was a problem that
seemed to demand a fix.

The underlying reason why all earlier versions of the patch had these
regressions came down to their use of a pure linear search (by which I
mean having _bt_readpage call _bt_checkeys again and again on the same
page). That was just too slow here. v15 could roughly halve the number
of index descents compared to master, as expected -- but that wasn't
enough to make up for the overhead of having to plow through hundreds
of non-matching tuples on every leaf page. v15 couldn't even break
even.

v16 of the patch fixes the problem by adding a limited additional form
of "skipping", used *within* a page, as it is scanned by _bt_readpage.
(As opposed to skipping over the index by redescending anew, starting
from the root page.)

In other words, v16 teaches the linear search process to occasionally
"look ahead" when it seems like the linear search process has required
too many comparisons at that point (comparisons by
_bt_tuple_before_array_skeys made from within _bt_checkkeys). You can
think of this new "look ahead" mechanism as bridging the gap between
case 1 and case 2 (fixing case 3, a hybrid of 1 and 2). We still
mostly want to use a "linear search" from within _bt_readpage, but we
do benefit from having this fallback when that just isn't working very
well. Now even these new type 3 stress tests see an increase in TPS of
perhaps 5% - 12%, depending on just how far apart you arrange for
matching tuples to be spaced apart by (the total size of the IN list
is also relevant, but much less so).

Separately, I added a new optimization to the binary search process
that selects the next array element via a binary search of the array
(actually, to the function _bt_binsrch_array_skey), which lowers the
cost of advancing required arrays that trigger advancement: we only
really need one comparison per _bt_binsrch_array_skey call in almost
all cases. In practice we'll almost certainly find that the next array
element in line is <= the unsatisfied tuple datum (we already know
from context that the current/former array element must now be < that
same tuple datum at that point, so the conditions under which this
doesn't work out right away are narrow). This second optimization is
much more general than the first one. It helps with pretty much any
kind of adversarial pgbench stress test.

Here, you insert your code between the comment about which opfamily to
choose and the code assigning the opfamily. I think this needs some
cleanup.

Fixed.

+             * Don't call _bt_preprocess_array_keys_final in this fast path
+             * (we'll miss out on the single value array transformation, but
+             * that's not nearly as important when there's only one scan key)

Why is it OK to ignore it? Or, why don't we apply it here?

It doesn't seem worth adding any cycles to that fast path, given that
the array transformation in question can only help us to convert "=
any('{1}')" into "= 1", because there's no other scan keys that can
cut down the number of array constants first (through earlier
array-specific preprocessing steps).

Note that even this can't be said for converting "IN (1)" into "= 1",
since the planner already does that for us, without nbtree ever seeing
it directly.

Attached 2 patches for further optimization of the _bt_preprocess_keys
path (on top of your v15), according to the following idea:

Right now, we do "expensive" processing with xform merging for all
keys when we have more than 1 keys in the scan. However, we only do
per-attribute merging of these keys, so if there is only one key for
any attribute, the many cycles spent in that loop are mostly wasted.

Why do you say that? I agree that calling _bt_compare_scankey_args()
more than necessary might matter, but that doesn't come up when
there's only one scan key per attribute. We'll only copy each scan key
into the strategy-specific slot for the attribute's temp xform[]
array.

Actually, it doesn't necessarily come up when there's more than one
scan key per index attribute, depending on the strategies in use. That
seems like an existing problem on HEAD; I think we should be doing
this more. We could stand to do better at preprocessing when
inequalities are mixed together. Right now, we can't detect
contradictory quals, given a case such as "SELECT * FROM
pgbench_accounts WHERE aid > 5 AND aid < 3". It'll only work for
something like "WHERE aid = 5 AND aid < 3", so we'll still useless
descend the index once (not a disaster, but not ideal either).

Honestly, I don't follow what the goal is here. Does it really help to
restructure the loop like this? Feels like I'm missing something. We
do this when there's only one scan key, sort of, but that's probably
not really that important anyway. Maybe it helps a bit for nestloop
joins with an inner index scan, where one array scan key is common.

--
Peter Geoghegan

Attachments:

v16-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v16-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 717fb84d412232eeee1e3f3c22175c09d064fd38 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v16] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans will now be executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (paths that used SAOP filter quals instead).
We'll no longer generate these alternative paths, since they can no
longer have any meaningful advantages over standard index qual paths.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions.  Such an index AM could have the
same limitations around order SOAP scans as nbtree had before now.
Although it seems unlikely that such an AM actually exists, it still
warrants a pro forma compatibility item in the release notes.

Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   66 +-
 src/backend/access/nbtree/nbtree.c         |  118 +-
 src/backend/access/nbtree/nbtsearch.c      |  246 +-
 src/backend/access/nbtree/nbtutils.c       | 2970 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   90 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   13 +
 src/test/regress/expected/create_index.out |   33 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   12 +-
 10 files changed, 3131 insertions(+), 544 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052..cb55e437a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,7 +960,7 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
@@ -1022,9 +1022,8 @@ typedef BTScanPosData *BTScanPos;
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
-	int			scan_key;		/* index of associated key in arrayKeyData */
+	int			scan_key;		/* index of associated key in keyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1037,14 +1036,14 @@ typedef struct BTScanOpaqueData
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
 	/* workspace for SK_SEARCHARRAY support */
-	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
-	int			numArrayKeys;	/* number of equality-type array keys (-1 if
-								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	int			numArrayKeys;	/* number of equality-type array keys */
+	ScanDirection primScanDir;	/* Scan direction for ongoing primitive scan */
+	bool		needPrimScan;	/* New prim scan to continue in primScanDir? */
+	bool		scanBehind;		/* First match for new keys on later page? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for all equality-type keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1074,41 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
+	ScanDirection dir;			/* current scan direction */
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	IndexTuple	finaltup;		/* Needed by scans with array keys */
+	Page		page;			/* For array keys "look ahead" optimization */
+
+	/* Per-tuple input parameters, set by _bt_readpage for _bt_checkkeys */
+	OffsetNumber offnum;		/* current tuple's page offset number */
+
+	/* Output parameter, set by _bt_checkkeys for _bt_readpage */
+	OffsetNumber skip;			/* Array keys "look ahead" skip offnum */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/*
+	 * Input and output parameters, set and unset by both _bt_readpage and
+	 * _bt_checkkeys to manage precheck optimizations
+	 */
+	bool		prechecked;		/* precheck set continuescan? */
+	bool		firstmatch;		/* at least one match so far?  */
+
+	/*
+	 * Private _bt_checkkeys state (used to manage "look ahead" optimization,
+	 * used only during scans that have array keys)
+	 */
+	uint16		rechecks;
+	uint16		targetdistance;
+
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1152,7 +1186,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1243,15 +1277,11 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_preprocess_array_keys(IndexScanDesc scan);
+extern bool _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+						  IndexTuple tuple, int tupnatts);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 41df1027d..b87eb5764 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -46,8 +46,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -67,8 +67,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -204,21 +204,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
-	/*
-	 * If we have any array keys, initialize them during first call for a
-	 * scan.  We can't do this in btrescan because we don't know the scan
-	 * direction at that time.
-	 */
-	if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return false;
-
-		_bt_start_array_keys(scan, dir);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -260,8 +246,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
 	return res;
 }
@@ -276,19 +262,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
-	/*
-	 * If we have any array keys, initialize them.
-	 */
-	if (so->numArrayKeys)
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return ntids;
-
-		_bt_start_array_keys(scan, ForwardScanDirection);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -318,8 +292,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -348,10 +322,12 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	else
 		so->keyData = NULL;
 
-	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->primScanDir = NoMovementScanDirection;
+	so->needPrimScan = false;
+	so->scanBehind = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -391,7 +367,11 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->numArrayKeys = 0;
+	so->primScanDir = NoMovementScanDirection;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -425,9 +405,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
-
-	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
-	_bt_preprocess_array_keys(scan);
 }
 
 /*
@@ -455,7 +432,7 @@ btendscan(IndexScanDesc scan)
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
-	/* so->arrayKeyData and so->arrayKeys are in arrayContext */
+	/* so->arrayKeys is in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
 	if (so->killedItems != NULL)
@@ -490,10 +467,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -504,10 +477,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -546,6 +515,13 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Reset the scan's array keys (see _bt_steppage for why) */
+			if (so->numArrayKeys)
+			{
+				_bt_start_array_keys(scan, so->primScanDir);
+				so->needPrimScan = false;
+				so->scanBehind = false;
+			}
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -572,7 +548,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -598,7 +574,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -609,7 +585,11 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -640,16 +620,16 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  The top-level index
+			 * scan might require additional primitive index scans.
 			 */
 			status = false;
 		}
@@ -681,9 +661,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -717,12 +700,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_start_prim_scan.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -736,14 +718,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -752,13 +734,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 23e723a23..b4d69b978 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,11 +907,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
 
+	if (so->numArrayKeys)
+	{
+		if (ScanDirectionIsNoMovement(so->primScanDir))
+		{
+			/*
+			 * First primitive index scan (for current btrescan).
+			 *
+			 * Initialize arrays, and the corresponding scan keys that were
+			 * just output by _bt_preprocess_keys.
+			 */
+			_bt_start_array_keys(scan, dir);
+		}
+		else
+		{
+			/*
+			 * Just stick with the array keys set by _bt_checkkeys at the end
+			 * of the previous primitive index scan.
+			 *
+			 * Note: The initial primitive index scan's _bt_preprocess_keys
+			 * call actually outputs new keys.  Later calls are just no-ops.
+			 * We're just here to build an insertion scan key using values
+			 * already set in so->keyData[] by _bt_checkkeys.
+			 */
+		}
+		so->primScanDir = dir;
+	}
+
 	/*
 	 * For parallel scans, get the starting page from shared state. If the
 	 * scan has not started, proceed to find out first leaf page in the usual
@@ -1527,11 +1554,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
-	bool		continuescanPrechecked;
-	bool		haveFirstMatch = false;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1551,11 +1577,25 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	pstate.dir = dir;
+	pstate.minoff = minoff;
+	pstate.maxoff = maxoff;
+	pstate.finaltup = NULL;
+	pstate.page = page;
+	pstate.offnum = InvalidOffsetNumber;
+	pstate.skip = InvalidOffsetNumber;
+	pstate.continuescan = true; /* default assumption */
+	pstate.prechecked = false;
+	pstate.firstmatch = false;
+	pstate.rechecks = 0;
+	pstate.targetdistance = 0;
+
+	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	arrayKeys = so->numArrayKeys != 0;
+
 	/*
 	 * We note the buffer's block number so that we can release the pin later.
 	 * This allows us to re-read the buffer if it is needed again for hinting.
@@ -1598,10 +1638,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * corresponding value from the last item on the page.  So checking with
 	 * the last item on the page would give a more precise answer.
 	 *
-	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * We skip this for the scan's first page to avoid slowing down point
+	 * queries.  We also have to avoid applying the optimization in rare cases
+	 * where it's not yet clear that the scan is at or ahead of its current
+	 * array keys.  If we're behind, but not too far behind (the start of
+	 * tuples matching the current keys is somewhere before the last item),
+	 * then the optimization is unsafe.
+	 *
+	 * Cases with multiple distinct sets of required array keys for key space
+	 * from the same leaf page can _attempt_ to use the precheck optimization,
+	 * though.  It won't work out, but there's no better way of figuring that
+	 * out than just optimistically attempting the precheck.
+	 *
+	 * The array keys safety issue is related to our reliance on _bt_first
+	 * passing us an offnum that's exactly at the beginning of where equal
+	 * tuples are to be found.  The underlying problem is that we have no
+	 * built-in ability to tell the difference between the start of required
+	 * equality matches and the end of required equality matches.  Array key
+	 * advancement within _bt_checkkeys has to act as a "_bt_first surrogate"
+	 * whenever the start of tuples matching the next set of array keys is
+	 * close to the end of tuples matching the current/last set of array keys.
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !so->scanBehind && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1668,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Do the precheck, while avoiding advancing the scan's array keys
+		 * prematurely
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, false, itup, indnatts);
+		pstate.prechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,23 +1710,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
 			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
 			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum < pstate.skip);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1696,7 +1762,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1779,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			pstate.prechecked = false;	/* precheck didn't cover HIKEY */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1800,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys && minoff <= maxoff && !P_LEFTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1772,23 +1847,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
 			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
 			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum > pstate.skip);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1824,7 +1904,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1970,6 +2050,33 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				   so->currPos.nextTupleOffset);
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
+
+		/*
+		 * If we're just about to start the next primitive index scan
+		 * (possible with a scan that has arrays keys, and needs to skip to
+		 * continue in the current scan direction), moreLeft/moreRight only
+		 * indicate the end of the current primitive index scan.  They must
+		 * never be taken to indicate that the top-level index scan has ended
+		 * (that would be wrong).
+		 *
+		 * We could handle this case by treating the current array keys as
+		 * markPos state.  But depending on the current array state like this
+		 * would add complexity.  Instead, we just unset markPos's copy of
+		 * moreRight or moreLeft (whichever might be affected), while making
+		 * btrestpos reset the scan's arrays to their initial scan positions.
+		 * In effect, btrestpos leaves advancing the arrays up to the first
+		 * _bt_readpage call (that takes place after it has restored markPos).
+		 * As long as the index key space is never ahead of the current array
+		 * keys, _bt_readpage handles this correctly (and efficiently).
+		 */
+		Assert(!so->numArrayKeys || dir == so->primScanDir);
+		if (so->needPrimScan)
+		{
+			if (ScanDirectionIsForward(dir))
+				so->markPos.moreRight = true;
+			else
+				so->markPos.moreLeft = true;
+		}
 	}
 
 	if (ScanDirectionIsForward(dir))
@@ -2072,6 +2179,20 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
+
+			/*
+			 * If the scan has array keys, consider handling a change in the
+			 * scan's direction
+			 */
+			if (so->numArrayKeys && so->primScanDir != dir)
+			{
+				Assert(!so->scanBehind);	/* Only set in forward direction */
+				Assert(!BTScanPosIsValid(so->markPos));
+
+				so->primScanDir = dir;
+				so->needPrimScan = false;
+			}
+
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
@@ -2152,6 +2273,19 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				return false;
 			}
 
+			/*
+			 * If the scan has array keys, consider handling a change in the
+			 * scan's direction
+			 */
+			if (so->numArrayKeys && so->primScanDir != dir)
+			{
+				Assert(!BTScanPosIsValid(so->markPos));
+
+				so->scanBehind = false;
+				so->primScanDir = dir;
+				so->needPrimScan = false;
+			}
+
 			/* Step to next physical page */
 			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
 
@@ -2530,8 +2664,20 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 static inline void
 _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 {
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
+	/*
+	 * Initialize moreLeft/moreRight for this primitive index scan.
+	 *
+	 * In general, scans that have array keys might still have matches on
+	 * pages in the direction opposite dir, the scan's current scan direction.
+	 * When we're called, the top-level scan often won't be at the start of
+	 * its first primitive index scan.
+	 */
+	if (so->numArrayKeys)
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = true;
+	}
+	else if (ScanDirectionIsForward(dir))
 	{
 		so->currPos.moreLeft = false;
 		so->currPos.moreRight = true;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d50317096..462e74829 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -29,29 +29,77 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+#define LOOK_AHEAD_REQUIRED_RECHECKS 	3
+#define LOOK_AHEAD_DEFAULT_DISTANCE 	5
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *sortproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
 
+typedef struct BTScanKeyPreproc
+{
+	ScanKey		skey;
+	int			ikey;
+	int			arrayidx;
+} BTScanKeyPreproc;
+
+static void _bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+								FmgrInfo *orderproc, FmgrInfo **sortprocp);
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
-									  StrategyNumber strat,
+									  Oid elemtype, StrategyNumber strat,
 									  Datum *elems, int nelems);
-static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-									bool reverse,
-									Datum *elems, int nelems);
+static int	_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc,
+									bool reverse, Datum *elems, int nelems);
+static bool _bt_merge_arrays(IndexScanDesc scan, ScanKey skey,
+							 FmgrInfo *sortproc, bool reverse,
+							 Oid origelemtype, Oid nextelemtype,
+							 Datum *elems_orig, int *nelems_orig,
+							 Datum *elems_next, int nelems_next);
+static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
+										   ScanKey arraysk, ScanKey skey,
+										   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+										   bool *qual_ok);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_trig, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+										 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
+										 bool readpagetup, int sktrig, bool *scanBehind);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+								   int sktrig, bool sktrig_required);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
+									 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool advancenonrequired, bool prechecked, bool firstmatch,
+							  bool *continuescan, int *ikey);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
+static void _bt_check_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
+								 ScanDirection dir, int tupnatts, TupleDesc tupdesc);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -188,29 +236,55 @@ _bt_freestack(BTStack stack)
  *
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
- * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * Return modified scan keys as input for further, standard preprocessing.
  *
- * Note: the reason we need so->arrayKeyData, rather than just scribbling
- * on scan->keyData, is that callers are permitted to call btrescan without
- * supplying a new set of scankey data.
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array element as redundant.  Similarly, we are capable of
+ * "merging together" multiple equality array keys (from two or more input
+ * scan keys) into a single output scan key containing only the intersecting
+ * array elements.  This can eliminate many redundant array elements, as well
+ * as eliminating whole array scan keys as redundant.  It can also allow us to
+ * detect contradictory quals.
+ *
+ * It is convenient for _bt_preprocess_keys caller to have to deal with no
+ * more than one equality strategy array scan key per index attribute.  We'll
+ * always be able to set things up that way when complete opfamilies are used.
+ * Eliminated array scan keys can be recognized as those that have had their
+ * sk_strategy field set to InvalidStrategy here by us.  Caller should avoid
+ * including these in the scan's so->keyData[] output array.
+ *
+ * We set the scan key references from the scan's BTArrayKeyInfo info array to
+ * offsets into the temp modified input array returned to caller.  Scans that
+ * have array keys should call _bt_preprocess_array_keys_final when standard
+ * preprocessing steps are complete.  This will convert the scan key offset
+ * references into references to the scan's so->keyData[] output scan keys.
+ *
+ * Note: the reason we need to return a temp scan key array, rather than just
+ * scribbling on scan->keyData, is that callers are permitted to call btrescan
+ * without supplying a new set of scankey data.
  */
-void
+static ScanKey
 _bt_preprocess_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
 	int			numberOfKeys = scan->numberOfKeys;
-	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16	   *indoption = rel->rd_indoption;
 	int			numArrayKeys;
+	int			origarrayatt = InvalidAttrNumber,
+				origarraykey = -1;
+	Oid			origelemtype = InvalidOid;
 	ScanKey		cur;
-	int			i;
 	MemoryContext oldContext;
+	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+
+	Assert(numberOfKeys && ScanDirectionIsNoMovement(so->primScanDir));
 
 	/* Quick check to see if there are any array keys */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
 		cur = &scan->keyData[i];
 		if (cur->sk_flags & SK_SEARCHARRAY)
@@ -220,20 +294,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			/* If any arrays are null as a whole, we can quit right now. */
 			if (cur->sk_flags & SK_ISNULL)
 			{
-				so->numArrayKeys = -1;
-				so->arrayKeyData = NULL;
-				return;
+				so->qual_ok = false;
+				return NULL;
 			}
 		}
 	}
 
 	/* Quit if nothing to do. */
 	if (numArrayKeys == 0)
-	{
-		so->numArrayKeys = 0;
-		so->arrayKeyData = NULL;
-		return;
-	}
+		return NULL;
 
 	/*
 	 * Make a scan-lifespan context to hold array-associated data, or reset it
@@ -249,18 +318,23 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	oldContext = MemoryContextSwitchTo(so->arrayContext);
 
 	/* Create modifiable copy of scan->keyData in the workspace context */
-	so->arrayKeyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-	memcpy(so->arrayKeyData,
-		   scan->keyData,
-		   scan->numberOfKeys * sizeof(ScanKeyData));
+	arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
+	memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
 
 	/* Allocate space for per-array data in the workspace context */
-	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
+
+	/* Allocate space for ORDER procs used to help _bt_checkkeys */
+	so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
+		FmgrInfo	sortproc;
+		FmgrInfo   *sortprocp = &sortproc;
+		Oid			elemtype;
+		bool		reverse;
 		ArrayType  *arrayval;
 		int16		elmlen;
 		bool		elmbyval;
@@ -271,7 +345,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			num_nonnulls;
 		int			j;
 
-		cur = &so->arrayKeyData[i];
+		cur = &arrayKeyData[i];
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -305,10 +379,21 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/* If there's no non-nulls, the scan qual is unsatisfiable */
 		if (num_nonnulls == 0)
 		{
-			numArrayKeys = -1;
+			so->qual_ok = false;
 			break;
 		}
 
+		/*
+		 * Determine the nominal datatype of the array elements.  We have to
+		 * support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this is a hack to simplify life for
+		 * ScanKeyInit().
+		 */
+		elemtype = cur->sk_subtype;
+		if (elemtype == InvalidOid)
+			elemtype = rel->rd_opcintype[cur->sk_attno - 1];
+		Assert(elemtype == ARR_ELEMTYPE(arrayval));
+
 		/*
 		 * If the comparison operator is not equality, then the array qual
 		 * degenerates to a simple comparison against the smallest or largest
@@ -319,7 +404,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTLessStrategyNumber:
 			case BTLessEqualStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTGreaterStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -329,7 +414,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTGreaterEqualStrategyNumber:
 			case BTGreaterStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTLessStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -339,17 +424,93 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 				break;
 		}
 
+		/*
+		 * We'll need a 3-way ORDER proc to perform binary searches for the
+		 * next matching array element.  Set that up now.
+		 *
+		 * Array scan keys with cross-type equality operators will require a
+		 * separate same-type ORDER proc for sorting their array.  Otherwise,
+		 * sortproc just points to the same proc used during binary searches.
+		 */
+		_bt_setup_array_cmp(scan, cur, elemtype,
+							&so->orderProcs[i], &sortprocp);
+
 		/*
 		 * Sort the non-null elements and eliminate any duplicates.  We must
 		 * sort in the same ordering used by the index column, so that the
-		 * successive primitive indexscans produce data in index order.
+		 * arrays can be advanced in lockstep with the scan's progress through
+		 * the index's key space.
 		 */
-		num_elems = _bt_sort_array_elements(scan, cur,
-											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+		reverse = (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0;
+		num_elems = _bt_sort_array_elements(cur, sortprocp, reverse,
 											elem_values, num_nonnulls);
 
+		if (origarrayatt == cur->sk_attno)
+		{
+			BTArrayKeyInfo *orig = &so->arrayKeys[origarraykey];
+
+			/*
+			 * This array scan key is redundant with a previous equality
+			 * operator array scan key.  Merge the two arrays together to
+			 * eliminate contradictory non-intersecting elements (or try to).
+			 *
+			 * We merge this next array back into attribute's original array.
+			 */
+			Assert(arrayKeyData[orig->scan_key].sk_attno == cur->sk_attno);
+			Assert(arrayKeyData[orig->scan_key].sk_collation ==
+				   cur->sk_collation);
+			if (_bt_merge_arrays(scan, cur, sortprocp, reverse,
+								 origelemtype, elemtype,
+								 orig->elem_values, &orig->num_elems,
+								 elem_values, num_elems))
+			{
+				/* Successfully eliminated this array */
+				pfree(elem_values);
+
+				/*
+				 * If no intersecting elements remain in the original array,
+				 * the scan qual is unsatisfiable
+				 */
+				if (orig->num_elems == 0)
+				{
+					so->qual_ok = false;
+					break;
+				}
+
+				/*
+				 * Indicate to _bt_preprocess_keys caller that it must ignore
+				 * this scan key
+				 */
+				cur->sk_strategy = InvalidStrategy;
+				continue;
+			}
+
+			/*
+			 * Unable to merge this array with previous array due to a lack of
+			 * suitable cross-type opfamily support.  Will need to keep both
+			 * scan keys/arrays.
+			 */
+		}
+		else
+		{
+			/*
+			 * This array is the first for current index attribute.
+			 *
+			 * If it turns out to not be the last array (that is, if the next
+			 * array is redundantly applied to the same index attribute),
+			 * we'll then treat this array as the attribute's "original" array
+			 * when merging.
+			 */
+			origarrayatt = cur->sk_attno;
+			origarraykey = numArrayKeys;
+			origelemtype = elemtype;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
+		 *
+		 * Note: _bt_preprocess_array_keys_final will fix-up each array's
+		 * scan_key field later on, after so->keyData[] has been finalized.
 		 */
 		so->arrayKeys[numArrayKeys].scan_key = i;
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
@@ -360,6 +521,246 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	so->numArrayKeys = numArrayKeys;
 
 	MemoryContextSwitchTo(oldContext);
+
+	return arrayKeyData;
+}
+
+/*
+ *
+ *	_bt_preprocess_array_keys_final() -- fix up array scan key references
+ *
+ * When _bt_preprocess_array_keys performed initial array preprocessing, it
+ * set each array's array->scan_key to the array's arrayKeys[] entry offset
+ * (that also work as references into the original scan->keyData[] array).
+ * This function handles translation of the scan key references from the
+ * BTArrayKeyInfo info array, from input scan key references (to the keys in
+ * scan->keyData[]), into output references (to the keys in so->keyData[]).
+ * Caller's keyDataMap[] array tells us how to perform this remapping.
+ *
+ * Also reorders so->orderProcs[] entries in-place.  This is required for all
+ * remaining equality-type scan keys (not just for those with an array).
+ *
+ * We'll also convert single-element array scan keys into equivalent non-array
+ * equality scan keys, which decrements so->numArrayKeys.  It's possible that
+ * this will leave this new btrescan without any arrays at all.  This isn't
+ * necessary for correctness; it's just an optimization.  Non-array equality
+ * scan keys are slightly faster than equivalent array scan keys at runtime.
+ *
+ * Note: _bt_compare_array_scankey_args always eliminates non-array equality
+ * scan keys that are redundant with some other array equality scan key (just
+ * like it will with any other type of non-array scan key).  Our "convert any
+ * single element array to a non-array scan key" optimization is therefore the
+ * only way that preprocessing can leave behind a non-array equality scan key
+ * (for index attributes with a partly-redundant array equality scan key).
+ */
+static void
+_bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	int			arrayidx = 0;
+	int			last_equal_output_ikey PG_USED_FOR_ASSERTS_ONLY = -1;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys);
+
+	for (int output_ikey = 0; output_ikey < so->numberOfKeys; output_ikey++)
+	{
+		ScanKey		outkey = so->keyData + output_ikey;
+		int			input_ikey;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+
+		Assert(outkey->sk_strategy != InvalidStrategy);
+
+		if (outkey->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		input_ikey = keyDataMap[output_ikey];
+
+		/*
+		 * Make it safe for _bt_advance_array_keys to subscript
+		 * so->orderProcs[] using simple so->keyData[]-wise offsets.
+		 *
+		 * It's safe to reorder so->orderProcs[] in-place.  Our caller is
+		 * required to output equality scan keys in their original input order
+		 * (which is also the order that arrays appear in).
+		 */
+		Assert(last_equal_output_ikey < output_ikey);
+		Assert(last_equal_output_ikey < input_ikey);
+		last_equal_output_ikey = output_ikey;
+
+		/*
+		 * We don't actually have to reorder so->orderProcs[] for non-array
+		 * equality scan keys, though -- they don't have a valid entry yet.
+		 *
+		 * We're lazy about looking up ORDER procs for non-array scan keys,
+		 * since not all input scan keys become output scan keys.
+		 */
+		if (!(outkey->sk_flags & SK_SEARCHARRAY))
+		{
+			Oid			elemtype;
+
+			/* No need for an ORDER proc given an IS NULL scan key */
+			if (outkey->sk_flags & SK_SEARCHNULL)
+				continue;
+
+			/* Look up this non-array equality scan key's ORDER proc */
+			elemtype = outkey->sk_subtype;
+			if (elemtype == InvalidOid)
+				elemtype = rel->rd_opcintype[outkey->sk_attno - 1];
+
+			_bt_setup_array_cmp(scan, outkey, elemtype,
+								&so->orderProcs[output_ikey], NULL);
+			continue;
+		}
+
+		/* Reorder so->orderProcs[] in-place for arrays */
+		so->orderProcs[output_ikey] = so->orderProcs[input_ikey];
+
+		/* Also fix-up array->scan_key references for arrays */
+		for (; arrayidx < so->numArrayKeys; arrayidx++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
+
+			Assert(array->num_elems > 0);
+
+			if (array->scan_key == input_ikey)
+			{
+				/* found it */
+				array->scan_key = output_ikey;
+				found = true;
+
+				/*
+				 * Transform array scan keys that have exactly 1 element
+				 * remaining (following all prior preprocessing) into
+				 * equivalent non-array scan keys.
+				 */
+				if (array->num_elems == 1)
+				{
+					outkey->sk_flags &= ~SK_SEARCHARRAY;
+					outkey->sk_argument = array->elem_values[0];
+					so->numArrayKeys--;
+
+					/* If we're out of array keys, we can quit right away */
+					if (so->numArrayKeys == 0)
+						return;
+
+					/* Shift other arrays forward */
+					memmove(array, array + 1,
+							sizeof(BTArrayKeyInfo) *
+							(so->numArrayKeys - arrayidx));
+
+					/*
+					 * Don't increment arrayidx (there was an entry that was
+					 * just shifted forward to the offset at arrayidx, which
+					 * will still need to be matched)
+					 */
+				}
+				else
+				{
+					/* Match found, so done with this array */
+					arrayidx++;
+				}
+
+				break;
+			}
+		}
+
+		Assert(found);
+	}
+}
+
+/*
+ * _bt_setup_array_cmp() -- Set up array comparison functions
+ *
+ * Sets ORDER proc in caller's orderproc argument, which is used during binary
+ * searches of arrays during the index scan.  Also sets a same-type ORDER proc
+ * in caller's *sortprocp argument, which is used when sorting the array.
+ *
+ * Preprocessing calls here with all equality strategy scan keys (when scan
+ * uses equality array keys), including those not associated with any array.
+ * See _bt_advance_array_keys for an explanation of why it'll need to treat
+ * simple scalar equality scan keys as degenerate single element arrays.
+ *
+ * Caller should pass an orderproc pointing to space that'll store the ORDER
+ * proc for the scan, and a *sortprocp pointing to its own separate space.
+ * When calling here for a non-array scan key, sortprocp arg should be NULL.
+ *
+ * In the common case where we don't need to deal with cross-type operators,
+ * only one ORDER proc is actually required by caller.  We'll set *sortprocp
+ * to point to the same memory that caller's orderproc continues to point to.
+ * Otherwise, *sortprocp will continue to point to caller's own space.  Either
+ * way, *sortprocp will point to a same-type ORDER proc (since that's the only
+ * safe way to sort/deduplicate the array associated with caller's scan key).
+ */
+static void
+_bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+					FmgrInfo *orderproc, FmgrInfo **sortprocp)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	RegProcedure cmp_proc;
+	Oid			opcintype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
+
+	/*
+	 * If scankey operator is not a cross-type comparison, we can use the
+	 * cached comparison function; otherwise gotta look it up in the catalogs
+	 */
+	if (elemtype == opcintype)
+	{
+		/* Set same-type ORDER procs for caller */
+		*orderproc = *index_getprocinfo(rel, skey->sk_attno, BTORDER_PROC);
+		if (sortprocp)
+			*sortprocp = orderproc;
+
+		return;
+	}
+
+	/*
+	 * Look up the appropriate cross-type comparison function in the opfamily.
+	 *
+	 * Use the opclass input type as the left hand arg type, and the array
+	 * element type as the right hand arg type (since binary searches use an
+	 * index tuple's attribute value to search for a matching array element).
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but only in cases where it's quite likely that _bt_first
+	 * would fail in just the same way (had we not failed before it could).
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 opcintype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, opcintype, elemtype, skey->sk_attno,
+			 RelationGetRelationName(rel));
+
+	/* Set cross-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+
+	/* Done if caller doesn't actually have an array they'll need to sort */
+	if (!sortprocp)
+		return;
+
+	/*
+	 * Look up the appropriate same-type comparison function in the opfamily.
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but it seems quite unlikely that an opfamily would omit
+	 * non-cross-type comparison procs for any datatype that it supports at
+	 * all.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 elemtype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, elemtype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set same-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, *sortprocp, so->arrayContext);
 }
 
 /*
@@ -370,27 +771,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
  * least element, or BTGreaterStrategyNumber to get the greatest.
  */
 static Datum
-_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
+_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey, Oid elemtype,
 						 StrategyNumber strat,
 						 Datum *elems, int nelems)
 {
 	Relation	rel = scan->indexRelation;
-	Oid			elemtype,
-				cmp_op;
+	Oid			cmp_op;
 	RegProcedure cmp_proc;
 	FmgrInfo	flinfo;
 	Datum		result;
 	int			i;
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
 	/*
 	 * Look up the appropriate comparison operator in the opfamily.
 	 *
@@ -399,6 +790,8 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	 * non-cross-type comparison operators for any datatype that it supports
 	 * at all.
 	 */
+	Assert(skey->sk_strategy != BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
 	cmp_op = get_opfamily_member(rel->rd_opfamily[skey->sk_attno - 1],
 								 elemtype,
 								 elemtype,
@@ -433,50 +826,21 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
  * The array elements are sorted in-place, and the new number of elements
  * after duplicate removal is returned.
  *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * skey identifies the index column whose opfamily determines the comparison
+ * semantics, and sortproc is a corresponding ORDER proc.  If reverse is true,
+ * we sort in descending order.
  */
 static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
+_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -487,6 +851,233 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge next array's elements into an original array
+ *
+ * Called when preprocessing encounters a pair of array equality scan keys,
+ * both against the same index attribute (during initial array preprocessing).
+ * Merging reorganizes caller's original array (the left hand arg) in-place,
+ * without ever copying elements from one array into the other. (Mixing the
+ * elements together like this would be wrong, since they don't necessarily
+ * use the same underlying element type, despite all the other similarities.)
+ *
+ * Both arrays must have already been sorted and deduplicated by calling
+ * _bt_sort_array_elements.  sortproc is the same-type ORDER proc that was
+ * just used to sort and deduplicate caller's "next" array.  We'll usually be
+ * able to reuse that order PROC to merge the arrays together now.  If not,
+ * then a separate ORDER proc lookup must be performed by us.
+ *
+ * If the opfamily doesn't supply a complete set of cross-type ORDER procs we
+ * may not be able to determine which elements are contradictory.  If we have
+ * the required ORDER proc then we return true (and validly set *nelems_orig),
+ * guaranteeing that at least the next array can be considered redundant.  We
+ * return false if the required comparisons cannot not be made (caller must
+ * keep both arrays when this happens).
+ */
+static bool
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, FmgrInfo *sortproc,
+				 bool reverse, Oid origelemtype, Oid nextelemtype,
+				 Datum *elems_orig, int *nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+	int			nelems_orig_start = *nelems_orig,
+				nelems_orig_merged = 0;
+	FmgrInfo   *mergeproc = sortproc;
+	FmgrInfo	cross_elemtype_orderproc;
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(origelemtype) && OidIsValid(nextelemtype));
+
+	if (origelemtype != nextelemtype)
+	{
+		RegProcedure cmp_proc;
+
+		/*
+		 * Cross-array-element-type merging is required, so can't just reuse
+		 * sortproc when merging
+		 */
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+									 origelemtype, nextelemtype, BTORDER_PROC);
+		if (!RegProcedureIsValid(cmp_proc))
+		{
+			/* Can't make the required comparisons */
+			return false;
+		}
+
+		/* We have all we need to determine redundancy/contradictoriness */
+		fmgr_info_cxt(cmp_proc, &cross_elemtype_orderproc, so->arrayContext);
+		mergeproc = &cross_elemtype_orderproc;
+	}
+
+	cxt.sortproc = mergeproc;
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+
+	for (int i = 0, j = 0; i < nelems_orig_start && j < nelems_next;)
+	{
+		Datum	   *oelem = elems_orig + i,
+				   *nelem = elems_next + j;
+		int			res = _bt_compare_array_elements(oelem, nelem, &cxt);
+
+		if (res == 0)
+		{
+			elems_orig[nelems_orig_merged++] = *oelem;
+			i++;
+			j++;
+		}
+		else if (res < 0)
+			i++;
+		else					/* res > 0 */
+			j++;
+	}
+
+	*nelems_orig = nelems_orig_merged;
+
+	return true;
+}
+
+/*
+ * Compare an array scan key to a scalar scan key, eliminating contradictory
+ * array elements such that the scalar scan key becomes redundant.
+ *
+ * Array elements can be eliminated as contradictory when excluded by some
+ * other operator on the same attribute.  For example, with an index scan qual
+ * "WHERE a IN (1, 2, 3) AND a < 2", all array elements except the value "1"
+ * are eliminated, and the < scan key is redundant.  In cases where all of the
+ * array's elements are contradictory, the scan qual is unsatisfiable, so we
+ * set *qual_ok=false for caller.
+ *
+ * If the opfamily doesn't supply a complete set of cross-type ORDER procs we
+ * may not be able to determine which elements are contradictory.  If we have
+ * the required ORDER proc then we return true (and validly set *qual_ok),
+ * guaranteeing that at least the scalar scan key can be considered redundant.
+ * We return false if the comparison could not be made (caller must keep both
+ * scan keys when this happens).
+ */
+static bool
+_bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+							   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+							   bool *qual_ok)
+{
+	Relation	rel = scan->indexRelation;
+	Oid			opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
+	int			cmpresult = 0,
+				cmpexact = 0,
+				matchelem,
+				new_nelems = 0;
+	FmgrInfo	nonmatch_orderproc;
+	FmgrInfo   *orderprocp = orderproc;
+
+	Assert(arraysk->sk_attno == skey->sk_attno);
+	Assert(array->num_elems > 0);
+	Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
+		   arraysk->sk_strategy == BTEqualStrategyNumber);
+	Assert(!(skey->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert(!(skey->sk_flags & SK_SEARCHARRAY) ||
+		   skey->sk_strategy != BTEqualStrategyNumber);
+
+	/*
+	 * _bt_binsrch_array_skey is designed to search scan key arrays using
+	 * datums of whatever type the relevant rel opclass uses (on-disk type).
+	 *
+	 * We can reuse the array's ORDER proc whenever the non-array scan key's
+	 * type is a match for the corresponding attribute's input opclass type.
+	 * Otherwise, we have to do another ORDER proc lookup so that our call to
+	 * _bt_binsrch_array_skey applies the correct comparator.
+	 *
+	 * Note: we have to support the convention that sk_subtype == InvalidOid
+	 * means the opclass input type; this is a hack to simplify life for
+	 * ScanKeyInit().
+	 */
+	if (skey->sk_subtype != opcintype && skey->sk_subtype != InvalidOid)
+	{
+		RegProcedure cmp_proc;
+		Oid			arraysk_elemtype;
+
+		/*
+		 * Need an ORDER proc lookup to detect redundancy/contradictoriness
+		 * with this pair of scankeys.
+		 *
+		 * Scalar scan key's argument will be passed to _bt_compare_array_skey
+		 * as its tupdatum/lefthand argument (rhs arg is for array elements).
+		 */
+		orderprocp = &nonmatch_orderproc;
+		arraysk_elemtype = arraysk->sk_subtype;
+		if (arraysk_elemtype == InvalidOid)
+			arraysk_elemtype = rel->rd_opcintype[arraysk->sk_attno - 1];
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[arraysk->sk_attno - 1],
+									 skey->sk_subtype, arraysk_elemtype,
+									 BTORDER_PROC);
+		if (!RegProcedureIsValid(cmp_proc))
+		{
+			/* Can't make the comparison */
+			*qual_ok = false;	/* suppress compiler warnings */
+			return false;
+		}
+
+		/* We have all we need to determine redundancy/contradictoriness */
+		fmgr_info(cmp_proc, orderprocp);
+	}
+
+	matchelem = _bt_binsrch_array_skey(orderprocp, false,
+									   NoMovementScanDirection,
+									   skey->sk_argument, false, array,
+									   arraysk, &cmpresult);
+
+	switch (skey->sk_strategy)
+	{
+		case BTLessStrategyNumber:
+			cmpexact = 1;		/* exclude exact match, if any */
+			/* FALL THRU */
+		case BTLessEqualStrategyNumber:
+			if (cmpresult >= cmpexact)
+				matchelem++;
+			/* Resize, keeping elements from the start of the array */
+			new_nelems = matchelem;
+			break;
+		case BTEqualStrategyNumber:
+			if (cmpresult != 0)
+			{
+				/* qual is unsatisfiable */
+				new_nelems = 0;
+			}
+			else
+			{
+				/* Shift matching element to the start of the array, resize */
+				array->elem_values[0] = array->elem_values[matchelem];
+				new_nelems = 1;
+			}
+			break;
+		case BTGreaterEqualStrategyNumber:
+			cmpexact = 1;		/* include exact match, if any */
+			/* FALL THRU */
+		case BTGreaterStrategyNumber:
+			if (cmpresult >= cmpexact)
+				matchelem++;
+			/* Shift matching elements to the start of the array, resize */
+			new_nelems = array->num_elems - matchelem;
+			memmove(array->elem_values, array->elem_values + matchelem,
+					sizeof(Datum) * new_nelems);
+			break;
+		default:
+			elog(ERROR, "unrecognized StrategyNumber: %d",
+				 (int) skey->sk_strategy);
+			break;
+	}
+
+	Assert(new_nelems >= 0);
+	Assert(new_nelems <= array->num_elems);
+
+	array->num_elems = new_nelems;
+	*qual_ok = new_nelems > 0;
+
+	return true;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -498,7 +1089,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->sortproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -506,11 +1097,235 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this array's
+ * scan key, and that the array is for a required scan key.  We can apply this
+ * information to find the next matching array element in the current scan
+ * direction using far fewer comparisons (fewer on average, compared to naive
+ * binary search).  This scheme takes advantage of an important property of
+ * required arrays: required arrays always advance in lockstep with the index
+ * scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_trig, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem = 0,
+				mid_elem = -1,
+				high_elem = array->num_elems - 1,
+				result = 0;
+	Datum		arrdatum;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur_elem_trig)
+	{
+		Assert(!ScanDirectionIsNoMovement(dir));
+		Assert(cur->sk_flags & SK_BT_REQFWD);
+
+		/*
+		 * When the scan key that triggered array advancement is a required
+		 * array scan key, it is now certain that the current array element
+		 * (plus all prior elements relative to the current scan direction)
+		 * cannot possibly be at or ahead of the corresponding tuple value.
+		 * (_bt_checkkeys must have called _bt_tuple_before_array_skeys, which
+		 * makes sure this is true as a condition of advancing the arrays.)
+		 *
+		 * This makes it safe to exclude array elements up to and including
+		 * the former-current array element from our search.
+		 *
+		 * Separately, when array advancement was triggered by a required scan
+		 * key, the array element immediately after the former-current element
+		 * is often either an exact tupdatum match, or a "close by" near-match
+		 * (a near-match tupdatum is one whose key space falls _between_ the
+		 * former-current and new-current array elements).  We'll detect both
+		 * cases via an optimistic comparison of the new search lower bound
+		 * (or new search upper bound in the case of backwards scans).
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			low_elem = array->cur_elem + 1; /* old cur_elem exhausted */
+
+			/* Compare prospective new cur_elem (also the new lower bound) */
+			if (high_elem >= low_elem)
+			{
+				arrdatum = array->elem_values[low_elem];
+				result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+												arrdatum, cur);
+
+				if (result <= 0)
+				{
+					/* Optimistic comparison optimisation worked out */
+					*set_elem_result = result;
+					return low_elem;
+				}
+				mid_elem = low_elem;
+				low_elem++;		/* this cur_elem exhausted, too */
+			}
+
+			if (high_elem < low_elem)
+			{
+				/* Caller needs to perform "beyond end" array advancement */
+				*set_elem_result = 1;
+				return high_elem;
+			}
+		}
+		else
+		{
+			high_elem = array->cur_elem - 1;	/* old cur_elem exhausted */
+
+			/* Compare prospective new cur_elem (also the new upper bound) */
+			if (high_elem >= low_elem)
+			{
+				arrdatum = array->elem_values[high_elem];
+				result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+												arrdatum, cur);
+
+				if (result >= 0)
+				{
+					/* Optimistic comparison optimisation worked out */
+					*set_elem_result = result;
+					return high_elem;
+				}
+				mid_elem = high_elem;
+				high_elem--;	/* this cur_elem exhausted, too */
+			}
+
+			if (high_elem < low_elem)
+			{
+				/* Caller needs to perform "beyond end" array advancement */
+				*set_elem_result = -1;
+				return low_elem;
+			}
+		}
+	}
+
+	while (high_elem > low_elem)
+	{
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * It's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the low_elem datum.  Must always set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
  * Set up the cur_elem counters and fill in the first sk_argument value for
- * each array scankey.  We can't do this until we know the scan direction.
+ * each array scankey.
  */
 void
 _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -518,159 +1333,1172 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			i;
 
+	Assert(so->numArrayKeys);
+	Assert(so->qual_ok);
+	Assert(!ScanDirectionIsNoMovement(dir));
+
 	for (i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 
 		Assert(curArrayKey->num_elems > 0);
+		Assert(skey->sk_flags & SK_SEARCHARRAY);
+
 		if (ScanDirectionIsBackward(dir))
 			curArrayKey->cur_elem = curArrayKey->num_elems - 1;
 		else
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications.  Rolling over like this is necessary to ensure correct
+	 * ordering of output when there are multiple array keys.
+	 *
+	 * We rely on the assumption that array preprocessing always leaves behind
+	 * exactly the arrays needed within the scan's preprocessed so->keyData[]
+	 * scan keys (needed for the current btrescan).  It is still possible that
+	 * there will be more than one array per index attribute, though only when
+	 * the attribute's opfamily is incomplete.  This is not an exception to
+	 * the general rule about so->keyData[] and arrays: there will still be a
+	 * 1:1 mapping between each array and each array equality scan key.
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 		int			cur_elem = curArrayKey->cur_elem;
 		int			num_elems = curArrayKey->num_elems;
+		bool		rolled = false;
 
-		if (ScanDirectionIsBackward(dir))
+		if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
 		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = 0;
+			rolled = true;
 		}
-		else
+		else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
 		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = num_elems - 1;
+			rolled = true;
 		}
 
 		curArrayKey->cur_elem = cur_elem;
 		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
+		if (!rolled)
+			return true;
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+		/* Need to advance next array key, if any */
+	}
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * The array keys are now exhausted.
+	 *
+	 * There isn't actually a distinct state that represents array exhaustion,
+	 * since index scans don't always end when btgettuple returns "false". The
+	 * scan direction might be reversed, or the scan might yet have its last
+	 * saved position restored.
+	 *
+	 * Restore the array keys to the state they were in immediately before we
+	 * were called.  This ensures that the arrays can only ever ratchet in the
+	 * scan's current direction.  Without this, scans would overlook matching
+	 * tuples if and when the scan's direction was subsequently reversed.
 	 */
-	if (!found)
-		so->arraysStarted = false;
+	_bt_start_array_keys(scan, -dir);
 
-	return found;
+	return false;
 }
 
 /*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
+ * _bt_rewind_nonrequired_arrays() -- Rewind non-required arrays
  *
- * Save the current state of the array keys as the "mark" position.
+ * Called when _bt_advance_array_keys decides to start a new primitive index
+ * scan on the basis of the current scan position being before the position
+ * that _bt_first is capable of repositioning the scan to by applying an
+ * inequality operator required in the opposite-to-scan direction only.
+ *
+ * Although equality strategy scan keys (for both arrays and non-arrays alike)
+ * are either marked required in both directions or in neither direction,
+ * there is a sense in which non-required arrays behave like required arrays.
+ * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
+ * the scan key on "c" is non-required, but nevertheless enables positioning
+ * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
+ * first descent of the tree by _bt_first.  Later on, there could also be a
+ * second descent, that places the scan right before tuples >= "(200, 3, 5)".
+ * _bt_first must never be allowed to build an insertion scan key whose "c"
+ * entry is set to a value other than 5, the "c" array's first element/value.
+ * (Actually, it's the first in the current scan direction.  This example uses
+ * a forward scan.)
+ *
+ * Calling here resets the array scan key elements for the scan's non-required
+ * arrays.  This is strictly necessary for correctness in a subset of cases
+ * involving "required in opposite direction"-triggered primitive index scans.
+ * Not all callers are at risk of _bt_first using a non-required array like
+ * this, but advancement always resets the arrays when another primitve scan
+ * is scheduled, just to keep things simple.  Array advancement even makes
+ * sure to reset non-required arrays during scans that have no inequalities.
+ * (Advancement still won't call here when there are no inequalities, though
+ * that's just because it's all handled indirectly instead.)
+ *
+ * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
+ * everybody got this right.
  */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
+static void
+_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
+	int			arrayidx = 0;
 
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->primScanDir == dir);
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		Assert(array->scan_key == ikey);
+
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+		{
+			array->cur_elem = first_elem_dir;
+			cur->sk_argument = array->elem_values[first_elem_dir];
+		}
 	}
 }
 
 /*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
+ * _bt_tuple_before_array_skeys() -- determine if tuple advances array keys
  *
- * Restore the array keys to where they were when the mark was set.
+ * We always compare the tuple using the current array keys (which we assume
+ * are already set in so->keyData[]).  readpagetup indicates if tuple is the
+ * scan's current _bt_readpage-wise tuple.
+ *
+ * readpagetup callers must only call here when _bt_check_compare already set
+ * continuescan=false.  We help these callers deal with _bt_check_compare's
+ * inability to distinguishing between the < and > cases (it uses equality
+ * operator scan keys, whereas we use 3-way ORDER procs).  These callers pass
+ * a _bt_check_compare-set sktrig value that indicates which scan key
+ * triggered the call (!readpagetup callers just pass us sktrig=0 instead).
+ * This information allows us to avoid wastefully checking earlier scan keys
+ * that were already deemed to have been satisfied inside _bt_check_compare.
+ *
+ * Returns false when caller's tuple is >= the current required equality scan
+ * keys (or <=, in the case of backwards scans).  This happens to readpagetup
+ * callers when the scan has reached the point of needing its array keys
+ * advanced; caller will need to advance required and non-required arrays at
+ * scan key offsets >= sktrig, plus scan keys < sktrig iff sktrig rolls over.
+ * (When we return false to readpagetup callers, tuple can only be == current
+ * required equality scan keys when caller's sktrig indicates that the arrays
+ * need to be advanced due to an unsatisfied required inequality key trigger.)
+ *
+ * Returns true when caller passes a tuple that is < the current set of
+ * equality keys for the most significant non-equal required scan key/column
+ * (or > the keys, during backwards scans).  This happens to readpagetup
+ * callers when tuple is still before the start of matches for the scan's
+ * required equality strategy scan keys.  (sktrig can't have indicated that an
+ * inequality strategy scan key wasn't satisfied in _bt_check_compare when we
+ * return true.  In fact, we automatically return false when passed such an
+ * inequality sktrig by readpagetup callers -- _bt_check_compare's initial
+ * continuescan=false doesn't really need to be confirmed here by us.)
+ *
+ * readpagetup callers shouldn't call here for unsatisfied non-required array
+ * scan keys, since _bt_check_compare is capable of handling those on its own
+ * (non-required array advancement can never roll over to higher order arrays,
+ * and so never affects required arrays, and so can't affect our answer).
+ *
+ * !readpagetup callers optionally pass us *scanBehind, which tracks whether
+ * any missing truncated attributes might have affected array advancement
+ * (compared to what would happen if it was shown the first non-pivot tuple on
+ * the page to the right of caller's finaltup/high key tuple instead).  It's
+ * only possible that we'll set *scanBehind to true when caller passes us a
+ * pivot tuple (with truncated attributes) that we return false for.
  */
-void
-_bt_restore_array_keys(IndexScanDesc scan)
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+							 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
+							 bool readpagetup, int sktrig, bool *scanBehind)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		changed = false;
-	int			i;
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->numArrayKeys);
+	Assert(so->numberOfKeys);
+	Assert(!so->needPrimScan);
+	Assert(sktrig == 0 || readpagetup);
+	Assert(!readpagetup || scanBehind == NULL);
+
+	if (scanBehind)
+		*scanBehind = false;
+
+	for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		ScanKey		cur = so->keyData + ikey;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		/* readpagetup calls require one ORDER proc comparison (at most) */
+		Assert(!readpagetup || ikey == sktrig);
+
+		/*
+		 * Once we reach a non-required scan key, we're completely done.
+		 *
+		 * Note: we deliberately don't consider the scan direction here.
+		 * _bt_advance_array_keys caller requires that we track *scanBehind
+		 * without concern for scan direction.
+		 */
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) == 0)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
-			changed = true;
+			Assert(!readpagetup);
+			Assert(ikey > sktrig || ikey == 0);
+			return false;
+		}
+
+		if (cur->sk_attno > tupnatts)
+		{
+			Assert(!readpagetup);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys (but set *scanBehind to let interested callers know
+			 * that a truncated attribute might have affected our answer).
+			 */
+			if (scanBehind)
+				*scanBehind = true;
+
+			return false;
+		}
+
+		/*
+		 * Deal with inequality strategy scan keys that _bt_check_compare set
+		 * continuescan=false for
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+		{
+			/*
+			 * When _bt_check_compare indicated that a required inequality
+			 * scan key wasn't satisfied, there's no need to verify anything;
+			 * caller always calls _bt_advance_array_keys with this sktrig.
+			 */
+			if (readpagetup)
+				return false;
+
+			/*
+			 * Otherwise we can't give up, since we must check all required
+			 * scan keys (required in either direction) in order to correctly
+			 * track *scanBehind for caller
+			 */
+			continue;
+		}
+
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		result = _bt_compare_array_skey(&so->orderProcs[ikey],
+										tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?  (Must be if we get here during a readpagetup call.)
+		 */
+		if (readpagetup || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconclusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or a call made from an assertion.
+		 */
+		Assert(result == 0);
+	}
+
+	Assert(!readpagetup);
+
+	return false;
+}
+
+/*
+ * _bt_start_prim_scan() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys,
+ * after _bt_first or _bt_next return false.
+ */
+bool
+_bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(so->primScanDir == dir || !so->qual_ok);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  The flag tells us what to do.
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys.  There are
+	 * various cases where that won't happen.  For example, if the index is
+	 * completely empty, then _bt_first won't call _bt_readpage/_bt_checkkeys.
+	 * We also don't expect a call to _bt_checkkeys during searches for a
+	 * non-existent value that happens to be lower/higher than any existing
+	 * value in the index.
+	 *
+	 * We don't require special handling for these cases -- we don't need to
+	 * be explicitly instructed to _not_ perform another primitive index scan.
+	 * It's up to code under the control of _bt_first to always set the flag
+	 * when another primitive index scan will be required.
+	 *
+	 * This works correctly, even with the tricky cases listed above, which
+	 * all involve access to leaf pages "near the boundaries of the key space"
+	 * (whether it's from a leftmost/rightmost page, or an imaginary empty
+	 * leaf root page).  If _bt_checkkeys cannot be reached by a primitive
+	 * index scan for one set of array keys, then it also won't be reached for
+	 * any later set ("later" in terms of the direction that we scan the index
+	 * and advance the arrays).  The array keys won't have advanced in these
+	 * cases, but that's the correct behavior (even _bt_advance_array_keys
+	 * won't always advance the arrays at the point they become "exhausted").
+	 */
+	if (so->needPrimScan)
+	{
+		Assert(_bt_verify_arrays_bt_first(scan, dir));
+
+		/* Flag was set -- must call _bt_first again */
+		so->needPrimScan = false;
+		so->scanBehind = false;
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/* The top-level index scan ran out of tuples in this scan direction */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * The scan always gets a new qual as a consequence of calling here (except
+ * when we determine that the top-level scan has run out of matching tuples).
+ * All later _bt_check_compare calls also use the same new qual that was first
+ * used here (at least until the next call here advances the keys once again).
+ * It's convenient to structure _bt_check_compare rechecks of caller's tuple
+ * (using the new qual) as one the steps of advancing the scan's array keys,
+ * so this function works as a wrapper around _bt_check_compare.
+ *
+ * Like _bt_check_compare, we'll set pstate.continuescan on behalf of the
+ * caller, and return a boolean indicating if caller's tuple satisfies the
+ * scan's new qual.  But unlike _bt_check_compare, we set so->needPrimScan
+ * when we set continuescan=false, indicating if a new primitive index scan
+ * has been scheduled (otherwise, the top-level scan has run out of tuples in
+ * the current scan direction).
+ *
+ * Caller must use _bt_tuple_before_array_skeys to determine if the current
+ * place in the scan is >= the current array keys _before_ calling here.
+ * We're responsible for ensuring that caller's tuple is <= the newly advanced
+ * required array keys once we return.  We try to find an exact match, but
+ * failing that we'll advance the array keys to whatever set of array elements
+ * comes next in the key space for the current scan direction.  Required array
+ * keys "ratchet forwards" (or backwards).  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The rules are the same for backwards scans, except that the operators are
+ * flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Callers whose
+ * sktrig scan key is non-required specify sktrig_required=false.  These calls
+ * are the only exception to the general rule about always advancing the
+ * required array keys (the scan may not even have a required array).  These
+ * callers should just pass a NULL pstate (since there is never any question
+ * of stopping the scan for these callers).
+ *
+ * Note that we deal with non-array required equality strategy scan keys as
+ * degenerate single element arrays here.  Obviously, they can never really
+ * advance in the way that real arrays can, but they must still affect how we
+ * advance real array scan keys (exactly like true array equality scan keys).
+ * We have to keep around a 3-way ORDER proc for these (using the "=" operator
+ * won't do), since in general whether the tuple is < or > _any_ unsatisfied
+ * required equality key influences how the scan's real arrays must advance.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple.  This is how we deal with inequalities that are
+ * required in the current scan direction.  They can advance the array keys
+ * here, even though they don't influence the initial positioning strategy
+ * within _bt_first (only inequalities required in the _opposite_ direction to
+ * the scan influence _bt_first in this way).  When sktrig (which is an offset
+ * to the unsatisfied scan key set by _bt_check_compare) is for a required
+ * inequality scan key, we'll perform array key advancement.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+					   int sktrig, bool sktrig_required)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate ? pstate->dir : ForwardScanDirection;
+	int			arrayidx = 0;
+	bool		beyond_end_advance = false,
+				has_required_opposite_direction_only = false,
+				oppodir_inequality_sktrig = false,
+				all_required_satisfied = true,
+				all_satisfied = true;
+
+	if (sktrig_required)
+	{
+		/*
+		 * Precondition array state assertions
+		 */
+		Assert(!so->needPrimScan && so->primScanDir == dir);
+		Assert(_bt_verify_keys_with_arraykeys(scan));
+		Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
+											 tupnatts, false, 0, NULL));
+
+		so->scanBehind = false; /* reset */
+
+		/*
+		 * Required scan key wasn't satisfied, so required arrays will have to
+		 * advance.  Invalidate page-level state that tracks whether the
+		 * scan's required-in-opposite-direction-only keys are known to be
+		 * satisfied by page's remaining tuples.
+		 */
+		pstate->firstmatch = false;
+
+		/*
+		 * Once we return we'll have a new set of required array keys, whose
+		 * "tuple before array keys" recheck counter should start from 0.
+		 *
+		 * Note that we deliberately avoid touching targetdistance, since
+		 * that's still considered representative of the page as a whole.
+		 */
+		pstate->rechecks = 0;
+
+		/* Shouldn't have to invalidate 'prechecked', though */
+		Assert(!pstate->prechecked);
+	}
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		Datum		tupdatum;
+		bool		required = false,
+					required_opposite_direction_only = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			/* Manage array state */
+			if (cur->sk_flags & SK_SEARCHARRAY)
+			{
+				array = &so->arrayKeys[arrayidx++];
+				Assert(array->scan_key == ikey);
+			}
+		}
+		else
+		{
+			/*
+			 * Are any inequalities required in the opposite direction only
+			 * present here?
+			 */
+			if (((ScanDirectionIsForward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQBKWD))) ||
+				 (ScanDirectionIsBackward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQFWD)))))
+				has_required_opposite_direction_only =
+					required_opposite_direction_only = true;
+		}
+
+		/* Optimization: skip over known-satisfied scan keys */
+		if (ikey < sktrig)
+			continue;
+
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			Assert(sktrig_required);
+
+			required = true;
+
+			if (cur->sk_attno > tupnatts)
+			{
+				/* Set this just like _bt_tuple_before_array_skeys */
+				Assert(sktrig < ikey);
+				so->scanBehind = true;
+			}
+		}
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now.  (The same underlying principle also gets
+		 * applied by the cur_elem_trig optimization used to speed up searches
+		 * for the next array element.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; we aren't capable of directly
+		 * evaluating required inequality strategy scan keys here, on our own.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(sktrig_required && required && all_required_satisfied);
+
+			/* Use "beyond end" advancement.  See below for an explanation. */
+			beyond_end_advance = true;
+			all_satisfied = all_required_satisfied = false;
+
+			/*
+			 * Set a flag that remembers that this was an inequality required
+			 * in the opposite scan direction only, that nevertheless
+			 * triggered the call here.
+			 *
+			 * This only happens when an inequality operator (which must be
+			 * strict) encounters a group of NULLs that indicate the end of
+			 * non-NULL values for tuples in the current scan direction.
+			 */
+			if (unlikely(required_opposite_direction_only))
+				oppodir_inequality_sktrig = true;
+
+			continue;
+		}
+
+		/*
+		 * Nothing more for us to do with an inequality strategy scan key that
+		 * wasn't the one that _bt_check_compare stopped on, though.
+		 *
+		 * Note: if our later call to _bt_check_compare (to recheck caller's
+		 * tuple) sets continuescan=false due to finding this same inequality
+		 * unsatisfied (possible when it's required in the scan direction),
+		 * we'll deal with it via a recursive "second pass" call.
+		 */
+		else if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Nothing for us to do with an equality strategy scan key that isn't
+		 * marked required, either.
+		 *
+		 * Non-required array scan keys are the only exception.  They're a
+		 * special case in that _bt_check_compare can set continuescan=false
+		 * for them, just as it will given an unsatisfied required scan key.
+		 * It's convenient to follow the same convention, since it results in
+		 * our getting called to advance non-required arrays in the same way
+		 * as required arrays (though we avoid stopping the scan for them).
+		 */
+		else if (!required && !array)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				cur->sk_argument = array->elem_values[final_elem_dir];
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_required_satisfied || cur->sk_attno > tupnatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				cur->sk_argument = array->elem_values[first_elem_dir];
+			}
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		cur_elem_trig = (sktrig_required && ikey == sktrig);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+											  cur_elem_trig, dir,
+											  tupdatum, tupnull, array, cur,
+											  &result);
+
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(required);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single element array.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(&so->orderProcs[ikey],
+											tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling for required scan keys that lack
+		 * a real array to advance, nor for redundant scan keys that couldn't
+		 * be eliminated by _bt_preprocess_keys.  It won't matter if some of
+		 * our "true" array scan keys (or even all of them) are non-required.
+		 */
+		if (required &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		Assert(all_required_satisfied && all_satisfied);
+		if (result != 0)
+		{
+			/*
+			 * Track whether caller's tuple satisfies our new post-advancement
+			 * qual, for required scan keys, as well as for the entire set of
+			 * interesting scan keys (all required scan keys plus non-required
+			 * array scan keys are considered interesting.)
+			 */
+			all_satisfied = false;
+			if (required)
+				all_required_satisfied = false;
+			else
+			{
+				/*
+				 * There's no need to advance the arrays using the best
+				 * available match for a non-required array.  Give up now.
+				 * (Though note that sktrig_required calls still have to do
+				 * all the usual post-advancement steps, including the recheck
+				 * call to _bt_check_compare.)
+				 */
+				break;
+			}
+		}
+
+		/* Advance array keys, even when set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			cur->sk_argument = array->elem_values[set_elem];
 		}
 	}
 
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is how top-level index
+	 * scans usually determine that they've run out of tuples to return in the
+	 * current scan direction (less often the top-level scan just runs out of
+	 * tuples/pages before the scan's array keys are exhausted).
 	 */
-	if (changed || !so->arraysStarted)
+	if (beyond_end_advance)
 	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
-}
+		Assert(sktrig_required && !all_required_satisfied && !all_satisfied);
 
+		if (!_bt_advance_array_keys_increment(scan, dir))
+		{
+			/* Arrays are exhausted */
+			goto end_toplevel_scan;
+		}
+	}
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	/*
+	 * Does tuple now satisfy our new qual?  Recheck with _bt_check_compare.
+	 *
+	 * Calls triggered by an unsatisfied required scan key, whose tuple now
+	 * satisfies all required scan keys, but not all nonrequired array keys,
+	 * will still require a recheck call to _bt_check_compare.  They'll still
+	 * need its "second pass" handling of required inequality scan keys.
+	 * (Might have missed a still-unsatisfied required inequality scan key
+	 * that caller didn't detect as the sktrig scan key during its initial
+	 * _bt_check_compare call that used the old/original qual.)
+	 *
+	 * Calls triggered by an unsatisfied nonrequired array scan key never need
+	 * "second pass" handling of required inequalities (nor any other handling
+	 * of any required scan key).  All that matters is whether caller's tuple
+	 * satisfies the new qual, so it's safe to just skip the _bt_check_compare
+	 * recheck when we've already determined that it can only return 'false'.
+	 */
+	if ((sktrig_required && all_required_satisfied) ||
+		(!sktrig_required && all_satisfied))
+	{
+		int			nsktrig = sktrig + 1;
+		bool		continuescan;
+
+		Assert(all_required_satisfied);
+
+		/* Recheck _bt_check_compare on behalf of caller */
+		if (_bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+							  false, false, false,
+							  &continuescan, &nsktrig) &&
+			!so->scanBehind)
+		{
+			/* This tuple satisfies the new qual */
+			Assert(all_satisfied && continuescan);
+
+			if (pstate)
+				pstate->continuescan = true;
+
+			return true;
+		}
+
+		/*
+		 * Consider "second pass" handling of required inequalities.
+		 *
+		 * It's possible that our _bt_check_compare call indicated that the
+		 * scan should end due to some unsatisfied inequality that wasn't
+		 * initially recognized as such by us.  Handle this by calling
+		 * ourselves recursively, this time indicating that the trigger is the
+		 * inequality that we missed first time around (and using a set of
+		 * required array/equality keys that are now exact matches for tuple).
+		 *
+		 * We make a strong, general guarantee that every _bt_checkkeys call
+		 * here will advance the array keys to the maximum possible extent
+		 * that we can know to be safe based on caller's tuple alone.  If we
+		 * didn't perform this step, then that guarantee wouldn't quite hold.
+		 */
+		if (unlikely(!continuescan))
+		{
+			bool		satisfied PG_USED_FOR_ASSERTS_ONLY;
+
+			Assert(sktrig_required);
+			Assert(so->keyData[nsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * The tuple must use "beyond end" advancement during the
+			 * recursive call, so we cannot possibly end up back here when
+			 * recursing.  We'll consume a small, fixed amount of stack space.
+			 */
+			Assert(!beyond_end_advance);
+
+			/* Advance the array keys a second time using same tuple */
+			satisfied = _bt_advance_array_keys(scan, pstate, tuple, tupnatts,
+											   tupdesc, nsktrig, true);
+
+			/* This tuple doesn't satisfy the inequality */
+			Assert(!satisfied);
+			return false;
+		}
+
+		/*
+		 * Some non-required scan key (from new qual) still not satisfied.
+		 *
+		 * All scan keys required in the current scan direction must still be
+		 * satisfied, though, so we can trust all_required_satisfied below.
+		 */
+	}
+
+	/*
+	 * When we were called just to deal with "advancing" non-required arrays,
+	 * this is as far as we can go (cannot stop the scan for these callers)
+	 */
+	if (!sktrig_required)
+	{
+		/* Caller's tuple doesn't match any qual */
+		return false;
+	}
+
+	/*
+	 * Postcondition array state assertions (for still-unsatisfied tuples).
+	 *
+	 * By here we have established that the scan's required arrays (scan must
+	 * have at least one required array) advanced, without becoming exhausted.
+	 *
+	 * Caller's tuple is now < the newly advanced array keys (or > when this
+	 * is a backwards scan), except in the case where we only got this far due
+	 * to an unsatisfied non-required scan key.  Verify that with an assert.
+	 *
+	 * Note: we don't just quit at this point when all required scan keys were
+	 * found to be satisfied because we need to consider edge-cases involving
+	 * scan keys required in the opposite direction only; those aren't tracked
+	 * by all_required_satisfied. (Actually, oppodir_inequality_sktrig trigger
+	 * scan keys are tracked by all_required_satisfied, since it's convenient
+	 * for _bt_check_compare to behave as if they are required in the current
+	 * scan direction to deal with NULLs.  We'll account for that separately.)
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts,
+										false, 0, NULL) ==
+		   !all_required_satisfied);
+
+	/*
+	 * We generally permit primitive index scans to continue onto the next
+	 * sibling page when the page's finaltup satisfies all required scan keys
+	 * at the point where we're between pages.
+	 *
+	 * If caller's tuple is also the page's finaltup, and we see that required
+	 * scan keys still aren't satisfied, start a new primitive index scan.
+	 */
+	if (!all_required_satisfied && pstate->finaltup == tuple)
+		goto new_prim_scan;
+
+	/*
+	 * Proactively check finaltup (don't wait until finaltup is reached by the
+	 * scan) when it might well turn out to not be satisfied later on.
+	 *
+	 * This isn't quite equivalent to looking ahead to check if finaltup will
+	 * also be satisfied by all required scan keys, since there isn't any real
+	 * handling of inequalities in _bt_tuple_before_array_skeys.  It wouldn't
+	 * make sense for us to evaluate inequalities when "looking ahead to
+	 * finaltup", though.  Inequalities that are required in the current scan
+	 * direction cannot affect how _bt_first repositions the top-level scan
+	 * (unless the scan direction happens to change).
+	 *
+	 * Note: if so->scanBehind hasn't already been set for finaltup by us,
+	 * it'll be set during this call to _bt_tuple_before_array_skeys.  Either
+	 * way, it'll be set correctly after this point.
+	 */
+	if (!all_required_satisfied && pstate->finaltup &&
+		_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
+									 BTreeTupleGetNAtts(pstate->finaltup, rel),
+									 false, 0, &so->scanBehind))
+		goto new_prim_scan;
+
+	/*
+	 * When we encounter a truncated finaltup high key attribute, we're
+	 * optimistic about the chances of its corresponding required scan key
+	 * being satisfied when we go on to check it against tuples from this
+	 * page's right sibling leaf page.  We consider truncated attributes to be
+	 * satisfied by required scan keys, which allows the primitive index scan
+	 * to continue to the next leaf page.  We must set so->scanBehind to true
+	 * to remember that the last page's finaltup had "satisfied" required scan
+	 * keys for one or more truncated attribute values (scan keys required in
+	 * _either_ scan direction).
+	 *
+	 * There is a chance that _bt_checkkeys (which checks so->scanBehind) will
+	 * find that even the sibling leaf page's finaltup is < the new array
+	 * keys.  When that happens, our optimistic policy will have incurred a
+	 * single extra leaf page access that could have been avoided.
+	 *
+	 * A pessimistic policy would give backward scans a gratuitous advantage
+	 * over forward scans.  We'd punish forward scans for applying more
+	 * accurate information from the high key, rather than just using the
+	 * final non-pivot tuple as finaltup, in the style of backward scans.
+	 * Being pessimistic would also give some scans with non-required arrays a
+	 * perverse advantage over similar scans that use required arrays instead.
+	 *
+	 * You can think of this as a speculative bet on what the scan is likely
+	 * to find on the next page.  It's not much of a gamble, though, since the
+	 * untruncated prefix of attributes must strictly satisfy the new qual
+	 * (though it's okay if any non-required scan keys fail to be satisfied).
+	 */
+	if (so->scanBehind && has_required_opposite_direction_only)
+	{
+		/*
+		 * However, we avoid this behavior whenever the scan involves a scan
+		 * key required in the opposite direction to the scan only, along with
+		 * a finaltup with at least one truncated attribute that's associated
+		 * with a scan key marked required (required in either direction).
+		 *
+		 * _bt_check_compare simply won't stop the scan for a scan key that's
+		 * marked required in the opposite scan direction only.  That leaves
+		 * us without any reliable way of reconsidering any opposite-direction
+		 * inequalities if it turns out that starting a new primitive index
+		 * scan will allow _bt_first to skip ahead by a great many leaf pages
+		 * (see next section for details of how that works).
+		 */
+		goto new_prim_scan;
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 * They can also signal that we should start a new primitive index scan.
+	 *
+	 * It's possible that the scan is now positioned where "matching" tuples
+	 * begin, and that caller's tuple satisfies all scan keys required in the
+	 * current scan direction.  But if caller's tuple still doesn't satisfy
+	 * other scan keys that are required in the opposite scan direction only
+	 * (e.g., a required >= strategy scan key when scan direction is forward),
+	 * it's still possible that there are many leaf pages before the page that
+	 * _bt_first could skip straight to.  Groveling through all those pages
+	 * will always give correct answers, but it can be very inefficient.  We
+	 * must avoid needlessly scanning extra pages.
+	 *
+	 * Separately, it's possible that _bt_check_compare set continuescan=false
+	 * for a scan key that's required in the opposite direction only.  This is
+	 * a special case, that happens only when _bt_check_compare sees that the
+	 * inequality encountered a NULL value.  This signals the end of non-NULL
+	 * values in the current scan direction, which is reason enough to end the
+	 * (primitive) scan.  If this happens at the start of a large group of
+	 * NULL values, then we shouldn't expect to be called again until after
+	 * the scan has already read indefinitely-many leaf pages full of tuples
+	 * with NULL suffix values.  We need a separate test for this case so that
+	 * we don't miss our only opportunity to skip over such a group of pages.
+	 *
+	 * Apply a test against finaltup to detect and recover from the problem:
+	 * if even finaltup doesn't satisfy such an inequality, we just skip by
+	 * starting a new primitive index scan.  When we skip, we know for sure
+	 * that all of the tuples on the current page following caller's tuple are
+	 * also before the _bt_first-wise start of tuples for our new qual.  That
+	 * at least suggests many more skippable pages beyond the current page.
+	 */
+	if (has_required_opposite_direction_only && pstate->finaltup &&
+		(all_required_satisfied || oppodir_inequality_sktrig))
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped;
+		bool		continuescanflip;
+		int			opsktrig;
+
+		/*
+		 * We're checking finaltup (which is usually not caller's tuple), so
+		 * cannot reuse work from caller's earlier _bt_check_compare call.
+		 *
+		 * Flip the scan direction when calling _bt_check_compare this time,
+		 * so that it will set continuescanflip=false when it encounters an
+		 * inequality required in the opposite scan direction.
+		 */
+		Assert(!so->scanBehind);
+		opsktrig = 0;
+		flipped = -dir;
+		_bt_check_compare(scan, flipped,
+						  pstate->finaltup, nfinaltupatts, tupdesc,
+						  false, false, false,
+						  &continuescanflip, &opsktrig);
+
+		/*
+		 * If we ended up here due to the all_required_satisfied criteria,
+		 * test opsktrig in a way that ensures that finaltup contains the same
+		 * prefix of key columns as caller's tuple (a prefix that satisfies
+		 * earlier required-in-current-direction scan keys).
+		 *
+		 * If we ended up here due to the oppodir_inequality_sktrig criteria,
+		 * test opsktrig in a way that ensures that the same scan key that our
+		 * caller found to be unsatisfied (by the scan's tuple) was also the
+		 * one unsatisfied just now (by finaltup).  That way we'll only start
+		 * a new primitive scan when we're sure that both tuples _don't_ share
+		 * the same prefix of satisfied equality-constrained attribute values,
+		 * and that finaltup has a non-NULL attribute value indicated by the
+		 * unsatisfied scan key at offset opsktrig/sktrig.  (This depends on
+		 * _bt_check_compare not caring about the direction that inequalities
+		 * are required in whenever NULL attribute values are unsatisfied.  It
+		 * only cares about the scan direction, and its relationship to
+		 * whether NULLs are stored first or last relative to non-NULLs.)
+		 */
+		Assert(all_required_satisfied != oppodir_inequality_sktrig);
+		if (unlikely(!continuescanflip &&
+					 ((all_required_satisfied && opsktrig > sktrig) ||
+					  (oppodir_inequality_sktrig && opsktrig >= sktrig))))
+		{
+			Assert(so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * Make sure that any non-required arrays are set to the first
+			 * array element for the current scan direction
+			 */
+			_bt_rewind_nonrequired_arrays(scan, dir);
+
+			goto new_prim_scan;
+		}
+	}
+
+	/*
+	 * Stick with the ongoing primitive index scan for now.
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will quickly discover the next tuple covered
+	 * by the current array keys.  Repeated binary searches of the page (one
+	 * binary search per array advancement) is unlikely to outperform one
+	 * continuous linear search of the whole page.
+	 *
+	 * Note: the linear search process has a "look ahead" mechanism that
+	 * allows _bt_checkkeys to detect cases where the page contains an
+	 * excessive number of "before array key" tuples.  If there is a large
+	 * group of non-matching tuples (tuples located before the key space where
+	 * we expect to find the first tuple matching the new array keys), then
+	 * _bt_readpage is instructed to skip over many of those tuples.
+	 */
+	pstate->continuescan = true;	/* Override _bt_check_compare */
+	so->needPrimScan = false;	/* _bt_readpage has more tuples to check */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+new_prim_scan:
+
+	/*
+	 * End this primitive index scan, but schedule another.
+	 *
+	 * Note: If the scan direction happens to change, this scheduled primitive
+	 * index scan won't go ahead after all.
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = true;	/* ...but call _bt_first again */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+end_toplevel_scan:
+
+	/*
+	 * End the current primitive index scan, but don't schedule another.
+	 *
+	 * This ends the entire top-level scan in the current scan direction.
+	 *
+	 * Note: The scan's arrays (including any non-required arrays) are now in
+	 * their final positions for the current scan direction.  If the scan
+	 * direction happens to change, then the arrays will already be in their
+	 * first positions for what will then be the current scan direction.
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = false;	/* ...don't call _bt_first again, though */
+	so->scanBehind = false;
+
+	/* Caller's tuple doesn't match any qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
  *
- * The given search-type keys (in scan->keyData[] or so->arrayKeyData[])
+ * The given search-type keys (taken from scan->keyData[])
  * are copied to so->keyData[] with possible transformation.
  * scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
  * the number of output keys (possibly less, never greater).
@@ -690,8 +2518,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * The output keys must be sorted by index attribute.  Presently we expect
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
- * within each attribute may be done as a byproduct of the processing here,
- * but no other code depends on that.
+ * within each attribute may be done as a byproduct of the processing here.
+ * That process must leave array scan keys (within an attribute) in the same
+ * order as corresponding entries from the scan's BTArrayKeyInfo array info
+ * (though the issue only comes up when _bt_preprocess_array_keys can't merge
+ * duplicative arrays together for lack of a suitable cross-type ORDER proc).
  *
  * The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
  * if they must be satisfied in order to continue the scan forward or backward
@@ -748,8 +2579,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
  *
  * Note: the reason we have to copy the preprocessed scan keys into private
  * storage is that we are modifying the array based on comparisons of the
- * key argument values, which could change on a rescan or after moving to
- * new elements of array keys.  Therefore we can't overwrite the source data.
+ * key argument values, which could change on a rescan.  Therefore we can't
+ * overwrite the source data.
  */
 void
 _bt_preprocess_keys(IndexScanDesc scan)
@@ -762,11 +2593,34 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
-	ScanKey		xform[BTMaxStrategyNumber];
+	BTScanKeyPreproc xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
 	AttrNumber	attno;
+	ScanKey		arrayKeyData;
+	int		   *keyDataMap = NULL;
+	int			arrayidx = 0;
+
+	Assert(!so->needPrimScan);
+
+	/*
+	 * We're called at the start of each primitive index scan during top-level
+	 * scans that use equality array keys.  We can reuse the scan keys that
+	 * were output at the start of the scan's first primitive index scan.
+	 * There is no need to perform exactly the same work more than once.
+	 */
+	if (so->numberOfKeys > 0)
+	{
+		/*
+		 * An earlier call to _bt_advance_array_keys already set everything up
+		 * for us.  Just assert that the scan's existing output scan keys are
+		 * consistent with its current array elements.
+		 */
+		Assert(so->numArrayKeys && !ScanDirectionIsNoMovement(so->primScanDir));
+		Assert(_bt_verify_keys_with_arraykeys(scan));
+		return;
+	}
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -775,11 +2629,28 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
+	arrayKeyData = _bt_preprocess_array_keys(scan);
+	if (!so->qual_ok)
+	{
+		/* unmatchable array, so give up */
+		so->qual_ok = false;
+		return;
+	}
+
 	/*
-	 * Read so->arrayKeyData if array keys are present, else scan->keyData
+	 * Treat arrayKeyData[] (a partially preprocessed copy of scan->keyData[])
+	 * as our input if _bt_preprocess_array_keys just allocated it, else just
+	 * use scan->keyData[]
 	 */
-	if (so->arrayKeyData != NULL)
-		inkeys = so->arrayKeyData;
+	if (arrayKeyData)
+	{
+		inkeys = arrayKeyData;
+
+		/* Also maintain keyDataMap for remapping so->orderProc[] later */
+		keyDataMap = MemoryContextAlloc(so->arrayContext,
+										numberOfKeys * sizeof(int));
+	}
 	else
 		inkeys = scan->keyData;
 
@@ -800,6 +2671,18 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
 			_bt_mark_scankey_required(outkeys);
+		if (arrayKeyData)
+		{
+			/*
+			 * Don't call _bt_preprocess_array_keys_final in this fast path
+			 * (we'll miss out on the single value array transformation, but
+			 * that's not nearly as important when there's only one scan key)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+				   so->arrayKeys[0].scan_key == 0);
+		}
+
 		return;
 	}
 
@@ -859,13 +2742,29 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 * check, and we've rejected any combination of it with a regular
 			 * equality condition; but not with other types of conditions.
 			 */
-			if (xform[BTEqualStrategyNumber - 1])
+			if (xform[BTEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1];
+				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
+				BTArrayKeyInfo *array = NULL;
+				FmgrInfo   *orderproc = NULL;
+
+				if (arrayKeyData && (eq->sk_flags & SK_SEARCHARRAY))
+				{
+					int			eq_in_ikey,
+								eq_arrayidx;
+
+					eq_in_ikey = xform[BTEqualStrategyNumber - 1].ikey;
+					eq_arrayidx = xform[BTEqualStrategyNumber - 1].arrayidx;
+					array = &so->arrayKeys[eq_arrayidx - 1];
+					orderproc = so->orderProcs + eq_in_ikey;
+
+					Assert(array->scan_key == eq_in_ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
 
 				for (j = BTMaxStrategyNumber; --j >= 0;)
 				{
-					ScanKey		chk = xform[j];
+					ScanKey		chk = xform[j].skey;
 
 					if (!chk || j == (BTEqualStrategyNumber - 1))
 						continue;
@@ -878,6 +2777,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					}
 
 					if (_bt_compare_scankey_args(scan, chk, eq, chk,
+												 array, orderproc,
 												 &test_result))
 					{
 						if (!test_result)
@@ -887,7 +2787,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							return;
 						}
 						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						Assert(!array || array->num_elems > 0);
+						xform[j].skey = NULL;
+						xform[j].ikey = -1;
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -896,36 +2798,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			}
 
 			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1]
-				&& xform[BTLessEqualStrategyNumber - 1])
+			if (xform[BTLessStrategyNumber - 1].skey
+				&& xform[BTLessEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1];
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1];
+				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, le, lt, le,
+				if (_bt_compare_scankey_args(scan, le, lt, le, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1] = NULL;
+						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTLessStrategyNumber - 1] = NULL;
+						xform[BTLessStrategyNumber - 1].skey = NULL;
 				}
 			}
 
 			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1]
-				&& xform[BTGreaterEqualStrategyNumber - 1])
+			if (xform[BTGreaterStrategyNumber - 1].skey
+				&& xform[BTGreaterEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1];
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1];
+				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, ge, gt, ge,
+				if (_bt_compare_scankey_args(scan, ge, gt, ge, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1] = NULL;
+						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTGreaterStrategyNumber - 1] = NULL;
+						xform[BTGreaterStrategyNumber - 1].skey = NULL;
 				}
 			}
 
@@ -936,11 +2838,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 */
 			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				if (xform[j])
+				if (xform[j].skey)
 				{
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-					memcpy(outkey, xform[j], sizeof(ScanKeyData));
+					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+					if (arrayKeyData)
+						keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 					if (priorNumberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
 				}
@@ -966,6 +2870,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
+			if (arrayKeyData)
+				keyDataMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 
@@ -977,20 +2883,112 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
-		/* have we seen one of these before? */
-		if (xform[j] == NULL)
+		/*
+		 * Does this input scan key require further processing as an array?
+		 */
+		if (cur->sk_strategy == InvalidStrategy)
 		{
-			/* nope, so remember this scankey */
-			xform[j] = cur;
+			/* _bt_preprocess_array_keys marked this array key redundant */
+			Assert(arrayKeyData);
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			continue;
+		}
+
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			(cur->sk_flags & SK_SEARCHARRAY))
+		{
+			/* _bt_preprocess_array_keys kept this array key */
+			Assert(arrayKeyData);
+			arrayidx++;
+		}
+
+		/*
+		 * have we seen a scan key for this same attribute and using this same
+		 * operator strategy before now?
+		 */
+		if (xform[j].skey == NULL)
+		{
+			/* nope, so this scan key wins by default (at least for now) */
+			xform[j].skey = cur;
+			xform[j].ikey = i;
+			xform[j].arrayidx = arrayidx;
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
-										 &test_result))
+			FmgrInfo   *orderproc = NULL;
+			BTArrayKeyInfo *array = NULL;
+
+			/*
+			 * Seen one of these before, so keep only the more restrictive key
+			 * if possible
+			 */
+			if (j == (BTEqualStrategyNumber - 1) && arrayKeyData)
 			{
+				/*
+				 * Have to set up array keys
+				 */
+				if ((cur->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[arrayidx - 1];
+					orderproc = so->orderProcs + i;
+
+					Assert(array->scan_key == i);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+				else if ((xform[j].skey->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[xform[j].arrayidx - 1];
+					orderproc = so->orderProcs + xform[j].ikey;
+
+					Assert(array->scan_key == xform[j].ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+
+				/*
+				 * Both scan keys might have arrays, in which case we'll
+				 * arbitrarily pass only one of the arrays.  That won't
+				 * matter, since _bt_compare_scankey_args is aware that two
+				 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
+				 * failed to eliminate redundant arrays through array merging.
+				 * _bt_compare_scankey_args just returns false when it sees
+				 * this; it won't even try to examine either array.
+				 */
+			}
+
+			if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+										 array, orderproc, &test_result))
+			{
+				/* Have all we need to determine redundancy */
 				if (test_result)
-					xform[j] = cur;
+				{
+					Assert(!array || array->num_elems > 0);
+
+					/*
+					 * New key is more restrictive, and so replaces old key...
+					 */
+					if (j != (BTEqualStrategyNumber - 1) ||
+						!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
+					{
+						Assert(!array || array->scan_key == i);
+						xform[j].skey = cur;
+						xform[j].ikey = i;
+						xform[j].arrayidx = arrayidx;
+					}
+					else
+					{
+						/*
+						 * ...unless we have to keep the old key because it's
+						 * an array that rendered the new key redundant.  We
+						 * need to make sure that we don't throw away an array
+						 * scan key.  _bt_compare_scankey_args expects us to
+						 * always keep arrays (and discard non-arrays).
+						 */
+						Assert(j == (BTEqualStrategyNumber - 1));
+						Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
+						Assert(xform[j].ikey == array->scan_key);
+						Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+					}
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1002,22 +3000,130 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			else
 			{
 				/*
-				 * We can't determine which key is more restrictive.  Keep the
-				 * previous one in xform[j] and push this one directly to the
-				 * output array.
+				 * We can't determine which key is more restrictive.  Push
+				 * xform[j] directly to the output array, then set xform[j] to
+				 * the new scan key.
+				 *
+				 * Note: We do things this way around so that our arrays are
+				 * always in the same order as their corresponding scan keys,
+				 * even with incomplete opfamilies.  _bt_advance_array_keys
+				 * depends on this.
 				 */
 				ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-				memcpy(outkey, cur, sizeof(ScanKeyData));
+				memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+				if (arrayKeyData)
+					keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 				if (numberOfEqualCols == attno - 1)
 					_bt_mark_scankey_required(outkey);
+				xform[j].skey = cur;
+				xform[j].ikey = i;
+				xform[j].arrayidx = arrayidx;
 			}
 		}
 	}
 
 	so->numberOfKeys = new_numberOfKeys;
+
+	/*
+	 * Now that we've output so->keyData[], and built a temporary mapping from
+	 * so->keyData[] (output scan keys) to scan->keyData[] (input scan keys),
+	 * fix each array->scan_key reference.  (Also consolidates so->orderProc[]
+	 * array, so it can be subscripted using so->keyData[]-wise offsets.)
+	 */
+	if (arrayKeyData)
+		_bt_preprocess_array_keys_final(scan, keyDataMap);
+
+	/* Could pfree arrayKeyData/keyDataMap now, but not worth the cycles */
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * Verify that the scan's qual state matches what we expect at the point that
+ * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
+ *
+ * We enforce a rule against non-required array scan keys: they must start out
+ * with whatever element is the first for the scan's current scan direction.
+ * See _bt_rewind_nonrequired_arrays comments for an explanation.
+ */
+static bool
+_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
+
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+			return false;
+	}
+
+	return _bt_verify_keys_with_arraykeys(scan);
+}
+
+/*
+ * Verify that the scan's "so->keyData[]" scan keys are in agreement with
+ * its array key state
+ */
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			last_sk_attno = InvalidAttrNumber,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		if (array->scan_key != ikey)
+			return false;
+
+		if (array->num_elems <= 0)
+			return false;
+
+		if (cur->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+		if (last_sk_attno > cur->sk_attno)
+			return false;
+		last_sk_attno = cur->sk_attno;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1033,9 +3139,24 @@ _bt_preprocess_keys(IndexScanDesc scan)
  * we store the operator result in *result and return true.  We return false
  * if the comparison could not be made.
  *
+ * If either leftarg or rightarg are an array, we'll apply array-specific
+ * rules to determine which array elements are redundant on behalf of caller.
+ * It is up to our caller to save whichever of the two scan keys is the array,
+ * and discard the non-array scan key (the non-array scan key is guaranteed to
+ * be redundant with any complete opfamily).  Caller isn't expected to call
+ * here with a pair of array scan keys provided we're dealing with a complete
+ * opfamily (_bt_preprocess_array_keys will merge array keys together to make
+ * sure of that).
+ *
+ * Note: we'll also shrink caller's array as needed to eliminate redundant
+ * array elements.  One reason why caller should prefer to discard non-array
+ * scan keys is so that we'll have the opportunity to shrink the array
+ * multiple times, in multiple calls (for each of several other scan keys on
+ * the same index attribute).
+ *
  * Note: op always points at the same ScanKey as either leftarg or rightarg.
- * Since we don't scribble on the scankeys, this aliasing should cause no
- * trouble.
+ * Since we don't scribble on the scankeys themselves, this aliasing should
+ * cause no trouble.
  *
  * Note: this routine needs to be insensitive to any DESC option applied
  * to the index column.  For example, "x < 4" is a tighter constraint than
@@ -1044,6 +3165,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 static bool
 _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 						 ScanKey leftarg, ScanKey rightarg,
+						 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 						 bool *result)
 {
 	Relation	rel = scan->indexRelation;
@@ -1112,6 +3234,48 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 		return true;
 	}
 
+	/*
+	 * If either leftarg or rightarg are equality-type array scankeys, we need
+	 * specialized handling (since by now we know that IS NULL wasn't used)
+	 */
+	if (array)
+	{
+		bool		leftarray,
+					rightarray;
+
+		leftarray = ((leftarg->sk_flags & SK_SEARCHARRAY) &&
+					 leftarg->sk_strategy == BTEqualStrategyNumber);
+		rightarray = ((rightarg->sk_flags & SK_SEARCHARRAY) &&
+					  rightarg->sk_strategy == BTEqualStrategyNumber);
+
+		/*
+		 * _bt_preprocess_array_keys is responsible for merging together array
+		 * scan keys, and will do so whenever the opfamily has the required
+		 * cross-type support.  If it failed to do that, we handle it just
+		 * like the case where we can't make the comparison ourselves.
+		 */
+		if (leftarray && rightarray)
+		{
+			/* Can't make the comparison */
+			*result = false;	/* suppress compiler warnings */
+			return false;
+		}
+
+		/*
+		 * Otherwise we need to determine if either one of leftarg or rightarg
+		 * uses an array, then pass this through to a dedicated helper
+		 * function.
+		 */
+		if (leftarray)
+			return _bt_compare_array_scankey_args(scan, leftarg, rightarg,
+												  orderproc, array, result);
+		else if (rightarray)
+			return _bt_compare_array_scankey_args(scan, rightarg, leftarg,
+												  orderproc, array, result);
+
+		/* FALL THRU */
+	}
+
 	/*
 	 * The opfamily we need to worry about is identified by the index column.
 	 */
@@ -1351,60 +3515,241 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
  * Forward scan callers can pass a high key tuple in the hopes of having
  * us set *continuescan to false, and avoiding an unnecessary visit to
  * the page to the right.
  *
+ * Advances the scan's array keys when necessary for arrayKeys=true callers.
+ * Caller can avoid all array related side-effects when calling just to do a
+ * page continuescan precheck -- pass arrayKeys=false for that.  Scans without
+ * any arrays keys must always pass arrayKeys=false.
+ *
+ * Also stops and starts primitive index scans for arrayKeys=true callers.
+ * Scans with array keys are required to set up page state that helps us with
+ * this.  The page's finaltup tuple (the page high key for a forward scan, or
+ * the page's first non-pivot tuple for a backward scan) must be set in
+ * pstate.finaltup ahead of the first call here for the page (or possibly the
+ * first call after an initial continuescan-setting page precheck call).  Set
+ * this to NULL for rightmost page (or the leftmost page for backwards scans).
+ *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: page level input and output parameters
+ * arrayKeys: should we advance the scan's array keys if necessary?
  * tuple: index tuple to test
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
- * 						   be true for the last item on the page
- * haveFirstMatch: indicates that we already have at least one match
- * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
-			  bool continuescanPrechecked, bool haveFirstMatch)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+			  IndexTuple tuple, int tupnatts)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanDirection dir = pstate->dir;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+	Assert(!arrayKeys || (so->primScanDir == dir && so->arrayKeys));
+	Assert(!so->scanBehind || ScanDirectionIsForward(dir));
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+							arrayKeys, pstate->prechecked, pstate->firstmatch,
+							&pstate->continuescan, &ikey);
+
+#ifdef USE_ASSERT_CHECKING
+	if (!arrayKeys && so->numArrayKeys)
+	{
+		/*
+		 * This is a continuescan precheck call for a scan with array keys.
+		 *
+		 * Assert that the scan isn't in danger of becoming confused.
+		 */
+		Assert(!so->scanBehind && !pstate->prechecked && !pstate->firstmatch);
+		Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
+											 tupnatts, false, 0, NULL));
+	}
+	if (pstate->prechecked || pstate->firstmatch)
+	{
+		bool		dcontinuescan;
+		int			dikey = 0;
+
+		/*
+		 * Call relied on continuescan/firstmatch prechecks -- assert that we
+		 * get the same answer without those optimizations
+		 */
+		Assert(res == _bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+										arrayKeys, false, false,
+										&dcontinuescan, &dikey));
+		Assert(pstate->continuescan == dcontinuescan);
+	}
+#endif
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality strategy array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * pstate.continuescan=false.
+	 */
+	if (!arrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.  This could mean that the tuple is just past
+	 * the end of matches for the current array keys.
+	 *
+	 * It's also possible that the scan is still _before_ the _start_ of
+	 * tuples matching the current set of array keys.  Check for that first.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts, true,
+									 ikey, NULL))
+	{
+		/*
+		 * Tuple is still before the start of matches according to the scan's
+		 * required array keys (according to _all_ of its required equality
+		 * strategy keys, actually).
+		 *
+		 * _bt_advance_array_keys occasionally sets so->scanBehind to signal
+		 * that the scan's current position/tuples might be significantly
+		 * behind (multiple pages behind) its current array keys.  When this
+		 * happens, we need to be prepared to recover by starting a new
+		 * primitive index scan here, on our own.
+		 */
+		Assert(!so->scanBehind ||
+			   so->keyData[ikey].sk_strategy == BTEqualStrategyNumber);
+		if (unlikely(so->scanBehind) && pstate->finaltup &&
+			_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
+										 BTreeTupleGetNAtts(pstate->finaltup,
+															scan->indexRelation),
+										 false, 0, NULL))
+		{
+			/* Cut our losses -- start a new primitive index scan now */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/* Override _bt_check_compare, continue primitive scan */
+			pstate->continuescan = true;
+
+			/*
+			 * We will end up here repeatedly given a group of tuples > the
+			 * previous array keys and < the now-current keys (for a backwards
+			 * scan it's just the same, though the operators swap positions).
+			 *
+			 * We must avoid allowing this linear search process to scan very
+			 * many tuples from well before the start of tuples matching the
+			 * current array keys (or from well before the point where we'll
+			 * once again have to advance the scan's array keys).
+			 *
+			 * We keep the overhead under control by speculatively "looking
+			 * ahead" to later still-unscanned items from this same leaf page.
+			 * We'll only attempt this once the number of tuples that the
+			 * linear search process has examined starts to get out of hand.
+			 */
+			pstate->rechecks++;
+			if (pstate->rechecks >= LOOK_AHEAD_REQUIRED_RECHECKS)
+			{
+				/* See if we should skip ahead within the current leaf page */
+				_bt_check_look_ahead(scan, pstate, dir, tupnatts, tupdesc);
+
+				/*
+				 * Might have set pstate.skip to a later page offset.  When
+				 * that happens then _bt_readpage caller will inexpensively
+				 * skip ahead to a later tuple from the same page (the one
+				 * just after the tuple we successfully "looked ahead" to).
+				 */
+			}
+		}
+
+		/* This indextuple doesn't match the current qual, in any case */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scan).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan (unless the required array keys become exhausted instead).
+	 *
+	 * Note: we might advance the required arrays when all existing keys are
+	 * already equal to the values from the tuple at this point.  See comments
+	 * above _bt_advance_array_keys about inequality driven array advancement.
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, tupnatts, tupdesc,
+								  ikey, true);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also sets *continuescan to false
+ * when it's also not possible for any later tuples to pass the current qual
+ * (with the scan's current set of array keys, in the current scan direction),
+ * in addition to setting *ikey to the so->keyData[] subscript/offset for the
+ * unsatisfied scan key (needed when caller must consider advancing the scan's
+ * array keys).
+ *
+ * This is a subroutine for _bt_checkkeys.  We provisionally assume that
+ * reaching the end of the current set of required keys (in particular the
+ * current required array keys) ends the ongoing (primitive) index scan.
+ * Callers without array keys should just end the scan right away when they
+ * find that continuescan has been set to false here by us.  Things are more
+ * complicated for callers with array keys.
+ *
+ * Callers with array keys must first consider advancing the arrays when
+ * continuescan has been set to false here by us.  They must then consider if
+ * it really does make sense to end the current (primitive) index scan, in
+ * light of everything that is known at that point.  (In general when we set
+ * continuescan=false for these callers it must be treated as provisional.)
+ *
+ * We deal with advancing unsatisfied non-required arrays directly, though.
+ * This is safe, since by definition non-required keys can't end the scan.
+ * This is just how we determine if non-required arrays are just unsatisfied
+ * by the current array key, or if they're truly unsatisfied (that is, if
+ * they're unsatisfied by every possible array key).
+ *
+ * Though we advance non-required array keys on our own, that shouldn't have
+ * any lasting consequences for the scan.  By definition, non-required arrays
+ * have no fixed relationship with the scan's progress.  (There are delicate
+ * considerations for non-required arrays when the arrays need to be advanced
+ * following our setting continuescan to false, but that doesn't concern us.)
+ *
+ * Pass advancenonrequired=false to avoid all array related side effects.
+ * This allows _bt_advance_array_keys caller to avoid infinite recursion.
+ */
+static bool
+_bt_check_compare(IndexScanDesc scan, ScanDirection dir,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool advancenonrequired, bool prechecked, bool firstmatch,
+				  bool *continuescan, int *ikey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
 	*continuescan = true;		/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (; *ikey < so->numberOfKeys; (*ikey)++)
 	{
+		ScanKey		key = so->keyData + *ikey;
 		Datum		datum;
 		bool		isNull;
-		Datum		test;
 		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+					requiredOppositeDirOnly = false;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1422,8 +3767,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
-			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
+		if (prechecked &&
+			(requiredSameDir || (requiredOppositeDirOnly && firstmatch)) &&
+			!(key->sk_flags & SK_ROW_HEADER))
 			continue;
 
 		if (key->sk_attno > tupnatts)
@@ -1434,7 +3780,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1495,6 +3840,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1511,6 +3858,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1524,24 +3873,15 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key-checking function, though only if we must.
+		 *
+		 * When a key is required in the opposite-of-scan direction _only_,
+		 * then it must already be satisfied if firstmatch=true indicates that
+		 * an earlier tuple from this same page satisfied it earlier on.
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
-		{
-			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-									 datum, key->sk_argument);
-		}
-		else
-		{
-			test = true;
-			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
-											 datum, key->sk_argument));
-		}
-
-		if (!DatumGetBool(test))
+		if (!(requiredOppositeDirOnly && firstmatch) &&
+			!DatumGetBool(FunctionCall2Coll(&key->sk_func, key->sk_collation,
+											datum, key->sk_argument)))
 		{
 			/*
 			 * Tuple fails this qual.  If it's a required qual for the current
@@ -1557,7 +3897,19 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				*continuescan = false;
 
 			/*
-			 * In any case, this indextuple doesn't match the qual.
+			 * If this is a non-required equality-type array key, the tuple
+			 * needs to be checked against every possible array key.  Handle
+			 * this by "advancing" the scan key's array to a matching value
+			 * (if we're successful then the tuple might match the qual).
+			 */
+			else if (advancenonrequired &&
+					 key->sk_strategy == BTEqualStrategyNumber &&
+					 (key->sk_flags & SK_SEARCHARRAY))
+				return _bt_advance_array_keys(scan, NULL, tuple, tupnatts,
+											  tupdesc, *ikey, false);
+
+			/*
+			 * This indextuple doesn't match the qual.
 			 */
 			return false;
 		}
@@ -1574,7 +3926,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1603,7 +3955,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
@@ -1630,6 +3981,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1646,6 +3999,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1741,6 +4096,115 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 	return result;
 }
 
+/*
+ * Determine if a scan with array keys should consider looking ahead.
+ *
+ * This is a subroutine for _bt_checkkeys.  It limits the worst case cost of
+ * _bt_readpage's linear search.  Many scans that use array keys won't run
+ * into this problem, since "looking ahead" to pstate.finaltup at the point
+ * that the scan's arrays advance usually suffices.  It is worth controlling
+ * the cost of _bt_readpage's _bt_checkkeys-based linear search on pages that
+ * contain key space that matches several distinct array keys (or distinct
+ * sets of array keys) spaced apart by dozens or hundreds of non-pivot tuples.
+ *
+ * When we perform look ahead, and the process succeeds, sets pstate.skip,
+ * which instructs _bt_readpage to skip ahead to that tuple next (could be
+ * past the end of the scan's leaf page).
+ *
+ * We ramp the look ahead distance up as it continues to be effective, and
+ * aggressively decrease it when it stops working.  Cases where looking ahead
+ * is very effective will still require several calls here per _bt_readpage.
+ *
+ * Calling here stashes information about the progress of array advancement on
+ * the page using certain private fields in pstate.  We need to track our
+ * progress so far to control ramping the optimization up (and down).
+ */
+static void
+_bt_check_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
+					 ScanDirection dir, int tupnatts, TupleDesc tupdesc)
+{
+	OffsetNumber mid;
+	IndexTuple	ahead;
+	int			distance;
+
+	/*
+	 * Don't look ahead when there aren't enough tuples remaining on the page
+	 * (in the current scan direction) for it to really make sense.
+	 */
+	distance = LOOK_AHEAD_DEFAULT_DISTANCE;
+	if (ScanDirectionIsForward(dir))
+	{
+		if (pstate->offnum >= pstate->maxoff - distance)
+			return;
+
+		/* Also don't do anything with high key */
+		if (pstate->offnum < pstate->minoff)
+			return;
+
+		mid = pstate->offnum + distance;
+	}
+	else
+	{
+		if (pstate->offnum <= pstate->minoff + distance)
+			return;
+
+		mid = pstate->offnum - distance;
+	}
+
+	/*
+	 *
+	 * The look ahead distance starts small, and ramps up as each call here
+	 * allows _bt_readpage to skip ever-more tuples from the current page
+	 */
+	if (!pstate->targetdistance)
+		pstate->targetdistance = distance;
+	else
+		pstate->targetdistance *= 2;
+
+	if (pstate->targetdistance)
+	{
+		if (ScanDirectionIsForward(dir))
+		{
+			mid = Min(pstate->maxoff,
+					  pstate->offnum + pstate->targetdistance);
+			distance = (pstate->maxoff - mid);
+		}
+		else
+		{
+			mid = Max(pstate->minoff,
+					  pstate->offnum - pstate->targetdistance);
+			distance = (pstate->offnum - mid);
+		}
+	}
+
+	ahead = (IndexTuple) PageGetItem(pstate->page,
+									 PageGetItemId(pstate->page, mid));
+	if (_bt_tuple_before_array_skeys(scan, dir, ahead, tupdesc, tupnatts,
+									 false, 0, NULL))
+	{
+		/*
+		 * Success -- instruct _bt_readpage to skip ahead to very next tuple
+		 * after the one we determined was still before the current array keys
+		 */
+		if (ScanDirectionIsForward(dir))
+			pstate->skip = mid + 1;
+		else
+			pstate->skip = mid - 1;
+	}
+	else
+	{
+		/*
+		 * Failure -- "ahead" tuple is too far ahead (we were too aggresive).
+		 *
+		 * Reset the number of rechecks, and aggressively reduce the target
+		 * distance.  Note that we're much more aggressive here than when
+		 * initially ramping up.
+		 */
+		pstate->rechecks = 0;
+		pstate->targetdistance /= 8;
+	}
+}
+
 /*
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbd..2230b1310 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,12 +816,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
 	int			indexcol;
 
+	Assert(skip_nonnative_saop != NULL || scantype == ST_BITMAPSCAN);
+
 	/*
 	 * Check that index supports the desired scan type(s)
 	 */
@@ -880,19 +849,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +864,18 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			if (skip_nonnative_saop && !index->amsearcharray &&
+				IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
-				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
+				/*
+				 * Caller asked us to generate IndexPaths that omit any
+				 * ScalarArrayOpExpr clauses when the underlying index AM
+				 * lacks native support.
+				 *
+				 * We must omit this clause (and tell caller about it).
+				 */
+				*skip_nonnative_saop = true;
+				continue;
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d..772dc664f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6557,8 +6557,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6573,7 +6571,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed.
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6603,19 +6601,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6653,27 +6640,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6683,11 +6674,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan.
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6703,10 +6692,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6719,7 +6706,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6797,7 +6784,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6812,17 +6798,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6862,14 +6843,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				double		alength = estimate_array_length(root, other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6929,13 +6905,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6945,6 +6914,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6952,9 +6963,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +6982,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac28..e49a4c0c1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4064,6 +4064,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Queries that use certain <acronym>SQL</acronym> constructs to search for
+    rows matching any value out of a list or array of multiple scalar values
+    (see <xref linkend="functions-comparisons"/>) perform multiple
+    <quote>primitive</quote> index scans (up to one primitive scan per scalar
+    value) during query execution.  Each internal primitive index scan
+    increments <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>,
+    so it's possible for the count of index scans to significantly exceed the
+    total number of index scan executor node executions.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 70ab47a92..ef7b84662 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1932,16 +1932,16 @@ ORDER BY unique1;
       42
 (3 rows)
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1952,26 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 63cddac0d..8b640c2fc 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8880,10 +8880,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..90a33795d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -765,6 +765,7 @@ SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -774,18 +775,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.43.0

#58

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Peter Geoghegan (#57)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Mar 21, 2024 at 8:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v16. This has some of the changes that you asked for. But
my main focus has been on fixing regressions that were discovered
during recent performance validation.

Attached is v17. This revision deals with problems with parallel index
scans that use array keys.

This patch is now *very* close to being committable. My only
outstanding concern is the costing/selfuncs.c aspects of the patch,
which I haven't thought about in many months. I'd estimate that I'm
now less than a week away from pushing this patch. (Plus I need to
decide which tests from my massive test suite should be included in
the committed patch, as standard regression tests.)

Back to v17. Earlier versions dealt with parallel index scans in a way
that could lead to very unbalanced utilization of worker processes (at
least with the wrong sort of query/index), which was clearly
unacceptable. The underlying issue was the continued use of
BTParallelScanDescData.btps_arrayKeyCount field in shared memory,
which was still used to stop workers from moving on to the next
primitive scan. That old approach made little sense with this new high
level design. (I have been aware of the problem for months, but didn't
get around to it until now.)

I wasn't specifically trying to win any benchmarks here -- I really
just wanted to make the design for parallel index scans with SAOPs
into a coherent part of the wider design, without introducing any
regressions. I applied my usual charter for the patch when working on
v17, by more or less removing any nbtree.c parallel index scan code
that had direct awareness of array keys as distinct primitive index
scans. This fixed the problem of unbalanced worker utilization, as
expected, but it also seems to have benefits particular to parallel
index scans with array keys -- which was unexpected.

We now use an approach that's almost identical to the approach taken
by every other type of parallel scan. My new approach naturally
reduces contention among backends that want to advance the scan. (Plus
parallel index scans get all of the same benefits as serial index
scans, of course.)

v17 replaces _bt_parallel_advance_array_keys with a new function,
called _bt_parallel_primscan_advance, which is now called from
_bt_advance_array_keys. The new function doesn't need to increment
btscan->btps_arrayKeyCount (nor does it increment a local copy in
BTScanOpaqueData.arrayKeyCount). It is called at the specific point
that array key advancement *tries* to start another primitive index
scan. The difference (relative to the serial case) is that it might
not succeed in doing so. It might not be possible to start another
primitive index scan, since it might turn out that some other backend
(one page ahead, or even one page behind) already did it for
everybody. At that point it's easy for the backend that failed to
start the next primitive scan to have _bt_advance_array_keys back out
of everything. Then the backend just goes back to consuming pages by
seizing the scan.

This approach preserves the ability of parallel scans to have
_bt_readpage release the next page before _bt_readpage even starts
examining tuples from the current page. It's possible that 2 or 3
backends (each of which scans its own leaf pages from a group of
adjacent pages) will "independently" decide that it's time to start
another scan -- but at most one backend can be granted the right to
do so.

It's even possible that one such backend among several (all of which
might be expected to try to start the next primitive scan in unison)
will *not* want to do so -- it could know better than the rest. This
can happen when the backend lands one page ahead of the rest, and it
just so happens to see no reason to not just continue the scan on the
leaf level for now. Such a backend can effectively veto the idea of
starting another primitive index scan for the whole top-level parallel
scan. This probably doesn't really matter much, in and of itself
(skipping ahead by only one or two pages is suboptimal but not really
a problem), but that's beside the point. The point is that having very
flexible rules like this works out to be both simpler and better
performing than a more rigid, pessimistic approach would be. In
particular, keeping the time that the scan is seized by any one
backend to an absolute minimum is important -- it may be better to
detect and recover from contention than to try to prevent them
altogether.

Just to be clear, all of this only comes up during parts of the scan
where primitive index scans actually look like a good idea in the
first place. It's perfectly possible for the scan to process thousands
of array keys, without any backend ever considering starting another
primitive index scan. Much of the time, every worker process can
independently advance their own private array keys, without that
requiring any sort of special coordination (nothing beyond the
coordination required by *all* parallel index scans, including those
without any SAOP arrays).

As I've said many times already, the high level charter of this
project is to make scans that have arrays (whether serial or parallel)
as similar as possible to any other type of scan.

--
Peter Geoghegan

Attachments:

v17-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/x-patch; name=v17-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From ded4560eb1ad4530d06e825c9d0cc1e5161ff9ae Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v17] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans will now be executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allows us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (paths that used SAOP filter quals instead).
We'll no longer generate these alternative paths, since they can no
longer have any meaningful advantages over standard index qual paths.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions.  Such an index AM could have the
same limitations around order SOAP scans as nbtree had before now.
Although it seems unlikely that such an AM actually exists, it still
warrants a pro forma compatibility item in the release notes.

Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   75 +-
 src/backend/access/nbtree/nbtree.c         |  136 +-
 src/backend/access/nbtree/nbtsearch.c      |  215 +-
 src/backend/access/nbtree/nbtutils.c       | 2933 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   90 +-
 src/backend/utils/adt/selfuncs.c           |  122 +-
 doc/src/sgml/monitoring.sgml               |   13 +
 src/test/regress/expected/create_index.out |   33 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   12 +-
 10 files changed, 3065 insertions(+), 569 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052..b590891cb 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,11 +960,21 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
 
+	/*
+	 * Direction of the scan at the time that _bt_readpage was called.
+	 *
+	 * Used by btrestrpos to "restore" the scan's array keys by resetting each
+	 * array to its first element's value (first in this scan direction).
+	 * _bt_checkkeys can quickly advance the array keys when required, so this
+	 * avoids the need to directly track the array keys in btmarkpos.
+	 */
+	ScanDirection dir;
+
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
 	 * location in the associated tuple storage workspace.
@@ -1022,9 +1032,8 @@ typedef BTScanPosData *BTScanPos;
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
-	int			scan_key;		/* index of associated key in arrayKeyData */
+	int			scan_key;		/* index of associated key in keyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1037,14 +1046,11 @@ typedef struct BTScanOpaqueData
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
 	/* workspace for SK_SEARCHARRAY support */
-	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
-	int			numArrayKeys;	/* number of equality-type array keys (-1 if
-								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	int			numArrayKeys;	/* number of equality-type array keys */
+	bool		needPrimScan;	/* New prim scan to continue in current dir? */
+	bool		scanBehind;		/* First match for new keys on later page? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for all equality-type keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1081,42 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
+	ScanDirection dir;			/* current scan direction */
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	IndexTuple	finaltup;		/* Needed by scans with array keys */
+	BlockNumber prev_scan_page; /* Last _bt_parallel_release block */
+	Page		page;			/* For array keys "look ahead" optimization */
+
+	/* Per-tuple input parameters, set by _bt_readpage for _bt_checkkeys */
+	OffsetNumber offnum;		/* current tuple's page offset number */
+
+	/* Output parameter, set by _bt_checkkeys for _bt_readpage */
+	OffsetNumber skip;			/* Array keys "look ahead" skip offnum */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/*
+	 * Input and output parameters, set and unset by both _bt_readpage and
+	 * _bt_checkkeys to manage precheck optimizations
+	 */
+	bool		prechecked;		/* precheck set continuescan? */
+	bool		firstmatch;		/* at least one match so far?  */
+
+	/*
+	 * Private _bt_checkkeys state (used to manage "look ahead" optimization,
+	 * used only during scans that have array keys)
+	 */
+	uint16		rechecks;
+	uint16		targetdistance;
+
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1152,7 +1194,8 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern bool _bt_parallel_primscan_advance(IndexScanDesc scan,
+										  BlockNumber prev_scan_page);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1243,15 +1286,11 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_preprocess_array_keys(IndexScanDesc scan);
+extern bool _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+						  IndexTuple tuple, int tupnatts);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 41df1027d..eb656d197 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -47,7 +47,6 @@
  * to a new page; some process can start doing that.
  *
  * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
  */
 typedef enum
 {
@@ -67,8 +66,6 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -204,21 +201,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
-	/*
-	 * If we have any array keys, initialize them during first call for a
-	 * scan.  We can't do this in btrescan because we don't know the scan
-	 * direction at that time.
-	 */
-	if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return false;
-
-		_bt_start_array_keys(scan, dir);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -260,8 +243,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
 	return res;
 }
@@ -276,19 +259,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
-	/*
-	 * If we have any array keys, initialize them.
-	 */
-	if (so->numArrayKeys)
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return ntids;
-
-		_bt_start_array_keys(scan, ForwardScanDirection);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -318,8 +289,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -348,10 +319,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	else
 		so->keyData = NULL;
 
-	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
+	so->scanBehind = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -391,7 +363,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->numArrayKeys = 0;
+	so->needPrimScan = false;
+	so->scanBehind = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -425,9 +399,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
-
-	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
-	_bt_preprocess_array_keys(scan);
 }
 
 /*
@@ -455,7 +426,7 @@ btendscan(IndexScanDesc scan)
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
-	/* so->arrayKeyData and so->arrayKeys are in arrayContext */
+	/* so->arrayKeys is in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
 	if (so->killedItems != NULL)
@@ -490,10 +461,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -504,10 +471,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -546,6 +509,13 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Reset the scan's array keys (see _bt_steppage for why) */
+			if (so->numArrayKeys)
+			{
+				_bt_start_array_keys(scan, so->currPos.dir);
+				so->needPrimScan = false;
+				so->scanBehind = false;
+			}
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -572,7 +542,6 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -598,7 +567,6 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -609,7 +577,7 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -618,6 +586,11 @@ btparallelrescan(IndexScanDesc scan)
  * true and set *pageno to P_NONE; after that, further attempts to seize the
  * scan will return false.
  *
+ * If the return value is false, it's possible that caller's scan is one with
+ * array keys, and that caller is supposed to start the next primitive scan.
+ * It's not appropriate for such a caller to wait, since they're already
+ * advancing the scan (next primitive index scan determines the next page).
+ *
  * Callers should ignore the value of pageno if the return value is false.
  */
 bool
@@ -632,6 +605,13 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 
 	*pageno = P_NONE;
 
+	/*
+	 * Force another primitive index scan when caller succesfully seized the
+	 * scan for that purpose already
+	 */
+	if (so->needPrimScan)
+		return false;
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -640,17 +620,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (pageStatus == BTPARALLEL_DONE)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
-			status = false;
-		}
-		else if (pageStatus == BTPARALLEL_DONE)
-		{
-			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
-			 */
+			/* We're done with this parallel index scan */
 			status = false;
 		}
 		else if (pageStatus != BTPARALLEL_ADVANCING)
@@ -704,7 +676,6 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 void
 _bt_parallel_done(IndexScanDesc scan)
 {
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
@@ -717,13 +688,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the parallel scan as done, unless some other process did so
+	 * already
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
-		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	if (btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
 		status_changed = true;
@@ -736,31 +705,38 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_primscan_advance() -- Advance the scan's primitive index.
  *
- * Updates the count of array keys processed for both local and parallel
- * scans.
+ * Caller passes the block number most recently passed to _bt_parallel_release
+ * by its backend.  Caller is only permitted to start another primitive scan
+ * when the scan is now idle, and has the same scan_page that caller set most
+ * recently.
+ *
+ * If the return value is true, caller starts the next primitive index scan on
+ * behalf of the parallel scan as a whole.  Otherwise caller continues with
+ * the scan, seizing the next block in the usual way.
  */
-void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+bool
+_bt_parallel_primscan_advance(IndexScanDesc scan, BlockNumber prev_scan_page)
 {
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
+	bool		advanced = false;
 
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	if (btscan->btps_pageStatus == BTPARALLEL_IDLE &&
+		btscan->btps_scanPage == prev_scan_page)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
-		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+		advanced = true;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
+
+	return advanced;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e3fff90d8..621c91ec7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,18 +907,27 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
 		_bt_parallel_done(scan);
 		return false;
 	}
 
+	if (so->numArrayKeys && !so->needPrimScan)
+	{
+		/*
+		 * First primitive index scan (for current btrescan).  Initialize
+		 * arrays, and the corresponding scan keys that were just output by
+		 * _bt_preprocess_keys.
+		 */
+		_bt_start_array_keys(scan, dir);
+	}
+
 	/*
 	 * For parallel scans, get the starting page from shared state. If the
 	 * scan has not started, proceed to find out first leaf page in the usual
 	 * way while keeping other participating processes waiting.  If the scan
 	 * has already begun, use the page number from the shared structure.
 	 */
-	if (scan->parallel_scan != NULL)
+	if (scan->parallel_scan != NULL && !so->needPrimScan)
 	{
 		status = _bt_parallel_seize(scan, &blkno);
 		if (!status)
@@ -1527,11 +1536,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
-	bool		continuescanPrechecked;
-	bool		haveFirstMatch = false;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1542,19 +1550,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
 
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	pstate.dir = dir;
+	pstate.minoff = minoff;
+	pstate.maxoff = maxoff;
+	pstate.finaltup = NULL;
+	pstate.page = page;
+	pstate.offnum = InvalidOffsetNumber;
+	pstate.skip = InvalidOffsetNumber;
+	pstate.continuescan = true; /* default assumption */
+	pstate.prechecked = false;
+	pstate.firstmatch = false;
+	pstate.rechecks = 0;
+	pstate.targetdistance = 0;
+
+	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	arrayKeys = so->numArrayKeys != 0;
+
 	/* allow next page be processed by parallel worker */
 	if (scan->parallel_scan)
 	{
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, opaque->btpo_next);
+			pstate.prev_scan_page = opaque->btpo_next;
 		else
-			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
-	}
+			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
-	minoff = P_FIRSTDATAKEY(opaque);
-	maxoff = PageGetMaxOffsetNumber(page);
+		_bt_parallel_release(scan, pstate.prev_scan_page);
+	}
 
 	/*
 	 * We note the buffer's block number so that we can release the pin later.
@@ -1598,10 +1622,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * corresponding value from the last item on the page.  So checking with
 	 * the last item on the page would give a more precise answer.
 	 *
-	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * We skip this for the scan's first page to avoid slowing down point
+	 * queries.  We also have to avoid applying the optimization in rare cases
+	 * where it's not yet clear that the scan is at or ahead of its current
+	 * array keys.  If we're behind, but not too far behind (the start of
+	 * tuples matching the current keys is somewhere before the last item),
+	 * then the optimization is unsafe.
+	 *
+	 * Cases with multiple distinct sets of required array keys for key space
+	 * from the same leaf page can _attempt_ to use the precheck optimization,
+	 * though.  It won't work out, but there's no better way of figuring that
+	 * out than just optimistically attempting the precheck.
+	 *
+	 * The array keys safety issue is related to our reliance on _bt_first
+	 * passing us an offnum that's exactly at the beginning of where equal
+	 * tuples are to be found.  The underlying problem is that we have no
+	 * built-in ability to tell the difference between the start of required
+	 * equality matches and the end of required equality matches.  Array key
+	 * advancement within _bt_checkkeys has to act as a "_bt_first surrogate"
+	 * whenever the start of tuples matching the next set of array keys is
+	 * close to the end of tuples matching the current/last set of array keys.
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !so->scanBehind && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1652,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Do the precheck, while avoiding advancing the scan's array keys
+		 * prematurely
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, false, itup, indnatts);
+		pstate.prechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,23 +1694,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
 			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
 			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum < pstate.skip);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1696,7 +1746,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1763,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			pstate.prechecked = false;	/* precheck didn't cover HIKEY */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1784,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys && minoff <= maxoff && !P_LEFTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1772,23 +1831,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
 			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
 			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum > pstate.skip);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1824,7 +1888,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1970,6 +2034,33 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				   so->currPos.nextTupleOffset);
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
+
+		/*
+		 * If we're just about to start the next primitive index scan
+		 * (possible with a scan that has arrays keys, and needs to skip to
+		 * continue in the current scan direction), moreLeft/moreRight only
+		 * indicate the end of the current primitive index scan.  They must
+		 * never be taken to indicate that the top-level index scan has ended
+		 * (that would be wrong).
+		 *
+		 * We could handle this case by treating the current array keys as
+		 * markPos state.  But depending on the current array state like this
+		 * would add complexity.  Instead, we just unset markPos's copy of
+		 * moreRight or moreLeft (whichever might be affected), while making
+		 * btrestpos reset the scan's arrays to their initial scan positions.
+		 * In effect, btrestpos leaves advancing the arrays up to the first
+		 * _bt_readpage call (that takes place after it has restored markPos).
+		 * As long as the index key space is never ahead of the current array
+		 * keys, _bt_readpage handles this correctly (and efficiently).
+		 */
+		if (so->needPrimScan)
+		{
+			Assert(so->markPos.dir == dir);
+			if (ScanDirectionIsForward(dir))
+				so->markPos.moreRight = true;
+			else
+				so->markPos.moreLeft = true;
+		}
 	}
 
 	if (ScanDirectionIsForward(dir))
@@ -2524,14 +2615,22 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
- * for scan direction
+ * _bt_initialize_more_data() -- initialize moreLeft, moreRight and scan dir
+ * from currPos
  */
 static inline void
 _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 {
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
+	so->currPos.dir = dir;
+	if (so->needPrimScan)
+	{
+		Assert(so->numArrayKeys);
+
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = true;
+		so->needPrimScan = false;
+	}
+	else if (ScanDirectionIsForward(dir))
 	{
 		so->currPos.moreLeft = false;
 		so->currPos.moreRight = true;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d50317096..363de5122 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -29,29 +29,77 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+#define LOOK_AHEAD_REQUIRED_RECHECKS 	3
+#define LOOK_AHEAD_DEFAULT_DISTANCE 	5
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *sortproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
 
+typedef struct BTScanKeyPreproc
+{
+	ScanKey		skey;
+	int			ikey;
+	int			arrayidx;
+} BTScanKeyPreproc;
+
+static void _bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+								FmgrInfo *orderproc, FmgrInfo **sortprocp);
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
-									  StrategyNumber strat,
+									  Oid elemtype, StrategyNumber strat,
 									  Datum *elems, int nelems);
-static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-									bool reverse,
-									Datum *elems, int nelems);
+static int	_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc,
+									bool reverse, Datum *elems, int nelems);
+static bool _bt_merge_arrays(IndexScanDesc scan, ScanKey skey,
+							 FmgrInfo *sortproc, bool reverse,
+							 Oid origelemtype, Oid nextelemtype,
+							 Datum *elems_orig, int *nelems_orig,
+							 Datum *elems_next, int nelems_next);
+static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
+										   ScanKey arraysk, ScanKey skey,
+										   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+										   bool *qual_ok);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_trig, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+										 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
+										 bool readpagetup, int sktrig, bool *scanBehind);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+								   int sktrig, bool sktrig_required);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
+									 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool advancenonrequired, bool prechecked, bool firstmatch,
+							  bool *continuescan, int *ikey);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
+static void _bt_check_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
+								 ScanDirection dir, int tupnatts, TupleDesc tupdesc);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -188,29 +236,55 @@ _bt_freestack(BTStack stack)
  *
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
- * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * Return modified scan keys as input for further, standard preprocessing.
  *
- * Note: the reason we need so->arrayKeyData, rather than just scribbling
- * on scan->keyData, is that callers are permitted to call btrescan without
- * supplying a new set of scankey data.
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array element as redundant.  Similarly, we are capable of
+ * "merging together" multiple equality array keys (from two or more input
+ * scan keys) into a single output scan key containing only the intersecting
+ * array elements.  This can eliminate many redundant array elements, as well
+ * as eliminating whole array scan keys as redundant.  It can also allow us to
+ * detect contradictory quals.
+ *
+ * It is convenient for _bt_preprocess_keys caller to have to deal with no
+ * more than one equality strategy array scan key per index attribute.  We'll
+ * always be able to set things up that way when complete opfamilies are used.
+ * Eliminated array scan keys can be recognized as those that have had their
+ * sk_strategy field set to InvalidStrategy here by us.  Caller should avoid
+ * including these in the scan's so->keyData[] output array.
+ *
+ * We set the scan key references from the scan's BTArrayKeyInfo info array to
+ * offsets into the temp modified input array returned to caller.  Scans that
+ * have array keys should call _bt_preprocess_array_keys_final when standard
+ * preprocessing steps are complete.  This will convert the scan key offset
+ * references into references to the scan's so->keyData[] output scan keys.
+ *
+ * Note: the reason we need to return a temp scan key array, rather than just
+ * scribbling on scan->keyData, is that callers are permitted to call btrescan
+ * without supplying a new set of scankey data.
  */
-void
+static ScanKey
 _bt_preprocess_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
 	int			numberOfKeys = scan->numberOfKeys;
-	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16	   *indoption = rel->rd_indoption;
 	int			numArrayKeys;
+	int			origarrayatt = InvalidAttrNumber,
+				origarraykey = -1;
+	Oid			origelemtype = InvalidOid;
 	ScanKey		cur;
-	int			i;
 	MemoryContext oldContext;
+	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+
+	Assert(numberOfKeys);
 
 	/* Quick check to see if there are any array keys */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
 		cur = &scan->keyData[i];
 		if (cur->sk_flags & SK_SEARCHARRAY)
@@ -220,20 +294,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			/* If any arrays are null as a whole, we can quit right now. */
 			if (cur->sk_flags & SK_ISNULL)
 			{
-				so->numArrayKeys = -1;
-				so->arrayKeyData = NULL;
-				return;
+				so->qual_ok = false;
+				return NULL;
 			}
 		}
 	}
 
 	/* Quit if nothing to do. */
 	if (numArrayKeys == 0)
-	{
-		so->numArrayKeys = 0;
-		so->arrayKeyData = NULL;
-		return;
-	}
+		return NULL;
 
 	/*
 	 * Make a scan-lifespan context to hold array-associated data, or reset it
@@ -249,18 +318,23 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	oldContext = MemoryContextSwitchTo(so->arrayContext);
 
 	/* Create modifiable copy of scan->keyData in the workspace context */
-	so->arrayKeyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-	memcpy(so->arrayKeyData,
-		   scan->keyData,
-		   scan->numberOfKeys * sizeof(ScanKeyData));
+	arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
+	memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
 
 	/* Allocate space for per-array data in the workspace context */
-	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
+
+	/* Allocate space for ORDER procs used to help _bt_checkkeys */
+	so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
+		FmgrInfo	sortproc;
+		FmgrInfo   *sortprocp = &sortproc;
+		Oid			elemtype;
+		bool		reverse;
 		ArrayType  *arrayval;
 		int16		elmlen;
 		bool		elmbyval;
@@ -271,7 +345,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			num_nonnulls;
 		int			j;
 
-		cur = &so->arrayKeyData[i];
+		cur = &arrayKeyData[i];
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -305,10 +379,21 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/* If there's no non-nulls, the scan qual is unsatisfiable */
 		if (num_nonnulls == 0)
 		{
-			numArrayKeys = -1;
+			so->qual_ok = false;
 			break;
 		}
 
+		/*
+		 * Determine the nominal datatype of the array elements.  We have to
+		 * support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this is a hack to simplify life for
+		 * ScanKeyInit().
+		 */
+		elemtype = cur->sk_subtype;
+		if (elemtype == InvalidOid)
+			elemtype = rel->rd_opcintype[cur->sk_attno - 1];
+		Assert(elemtype == ARR_ELEMTYPE(arrayval));
+
 		/*
 		 * If the comparison operator is not equality, then the array qual
 		 * degenerates to a simple comparison against the smallest or largest
@@ -319,7 +404,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTLessStrategyNumber:
 			case BTLessEqualStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTGreaterStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -329,7 +414,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTGreaterEqualStrategyNumber:
 			case BTGreaterStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTLessStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -339,17 +424,93 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 				break;
 		}
 
+		/*
+		 * We'll need a 3-way ORDER proc to perform binary searches for the
+		 * next matching array element.  Set that up now.
+		 *
+		 * Array scan keys with cross-type equality operators will require a
+		 * separate same-type ORDER proc for sorting their array.  Otherwise,
+		 * sortproc just points to the same proc used during binary searches.
+		 */
+		_bt_setup_array_cmp(scan, cur, elemtype,
+							&so->orderProcs[i], &sortprocp);
+
 		/*
 		 * Sort the non-null elements and eliminate any duplicates.  We must
 		 * sort in the same ordering used by the index column, so that the
-		 * successive primitive indexscans produce data in index order.
+		 * arrays can be advanced in lockstep with the scan's progress through
+		 * the index's key space.
 		 */
-		num_elems = _bt_sort_array_elements(scan, cur,
-											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+		reverse = (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0;
+		num_elems = _bt_sort_array_elements(cur, sortprocp, reverse,
 											elem_values, num_nonnulls);
 
+		if (origarrayatt == cur->sk_attno)
+		{
+			BTArrayKeyInfo *orig = &so->arrayKeys[origarraykey];
+
+			/*
+			 * This array scan key is redundant with a previous equality
+			 * operator array scan key.  Merge the two arrays together to
+			 * eliminate contradictory non-intersecting elements (or try to).
+			 *
+			 * We merge this next array back into attribute's original array.
+			 */
+			Assert(arrayKeyData[orig->scan_key].sk_attno == cur->sk_attno);
+			Assert(arrayKeyData[orig->scan_key].sk_collation ==
+				   cur->sk_collation);
+			if (_bt_merge_arrays(scan, cur, sortprocp, reverse,
+								 origelemtype, elemtype,
+								 orig->elem_values, &orig->num_elems,
+								 elem_values, num_elems))
+			{
+				/* Successfully eliminated this array */
+				pfree(elem_values);
+
+				/*
+				 * If no intersecting elements remain in the original array,
+				 * the scan qual is unsatisfiable
+				 */
+				if (orig->num_elems == 0)
+				{
+					so->qual_ok = false;
+					break;
+				}
+
+				/*
+				 * Indicate to _bt_preprocess_keys caller that it must ignore
+				 * this scan key
+				 */
+				cur->sk_strategy = InvalidStrategy;
+				continue;
+			}
+
+			/*
+			 * Unable to merge this array with previous array due to a lack of
+			 * suitable cross-type opfamily support.  Will need to keep both
+			 * scan keys/arrays.
+			 */
+		}
+		else
+		{
+			/*
+			 * This array is the first for current index attribute.
+			 *
+			 * If it turns out to not be the last array (that is, if the next
+			 * array is redundantly applied to the same index attribute),
+			 * we'll then treat this array as the attribute's "original" array
+			 * when merging.
+			 */
+			origarrayatt = cur->sk_attno;
+			origarraykey = numArrayKeys;
+			origelemtype = elemtype;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
+		 *
+		 * Note: _bt_preprocess_array_keys_final will fix-up each array's
+		 * scan_key field later on, after so->keyData[] has been finalized.
 		 */
 		so->arrayKeys[numArrayKeys].scan_key = i;
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
@@ -360,6 +521,236 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	so->numArrayKeys = numArrayKeys;
 
 	MemoryContextSwitchTo(oldContext);
+
+	return arrayKeyData;
+}
+
+/*
+ *	_bt_preprocess_array_keys_final() -- fix up array scan key references
+ *
+ * When _bt_preprocess_array_keys performed initial array preprocessing, it
+ * set each array's array->scan_key to the array's arrayKeys[] entry offset
+ * (that also work as references into the original scan->keyData[] array).
+ * This function handles translation of the scan key references from the
+ * BTArrayKeyInfo info array, from input scan key references (to the keys in
+ * scan->keyData[]), into output references (to the keys in so->keyData[]).
+ * Caller's keyDataMap[] array tells us how to perform this remapping.
+ *
+ * Also finalizes so->orderProcs[] for the scan.  Arrays already have an ORDER
+ * proc, which might need to be repositioned to its so->keyData[]-wise offset
+ * (very much like the remapping that we apply to array->scan_key references).
+ * Non-array equality strategy scan keys (that survived preprocessing) don't
+ * yet have an so->orderProcs[] entry, so we set one for them here.
+ *
+ * Also converts single-element array scan keys into equivalent non-array
+ * equality scan keys, which decrements so->numArrayKeys.  It's possible that
+ * this will leave this new btrescan without any arrays at all.  This isn't
+ * necessary for correctness; it's just an optimization.  Non-array equality
+ * scan keys are slightly faster than equivalent array scan keys at runtime.
+ */
+static void
+_bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	int			arrayidx = 0;
+	int			last_equal_output_ikey PG_USED_FOR_ASSERTS_ONLY = -1;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys);
+
+	for (int output_ikey = 0; output_ikey < so->numberOfKeys; output_ikey++)
+	{
+		ScanKey		outkey = so->keyData + output_ikey;
+		int			input_ikey;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+
+		Assert(outkey->sk_strategy != InvalidStrategy);
+
+		if (outkey->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		input_ikey = keyDataMap[output_ikey];
+
+		Assert(last_equal_output_ikey < output_ikey);
+		Assert(last_equal_output_ikey < input_ikey);
+		last_equal_output_ikey = output_ikey;
+
+		/*
+		 * We're lazy about looking up ORDER procs for non-array scan keys,
+		 * since not all input scan keys become output scan keys.  Do it now.
+		 */
+		if (!(outkey->sk_flags & SK_SEARCHARRAY))
+		{
+			Oid			elemtype;
+
+			/* No need for an ORDER proc given an IS NULL scan key */
+			if (outkey->sk_flags & SK_SEARCHNULL)
+				continue;
+
+			elemtype = outkey->sk_subtype;
+			if (elemtype == InvalidOid)
+				elemtype = rel->rd_opcintype[outkey->sk_attno - 1];
+
+			_bt_setup_array_cmp(scan, outkey, elemtype,
+								&so->orderProcs[output_ikey], NULL);
+			continue;
+		}
+
+		/*
+		 * Reorder existing array scan key so->orderProcs[] entries.
+		 *
+		 * Doing this in-place is safe because preprocessing is required to
+		 * output all equality strategy scan keys in original input order
+		 * (among each group of entries against the same index attribute).
+		 * This is also the order that the arrays themselves appear in.
+		 */
+		so->orderProcs[output_ikey] = so->orderProcs[input_ikey];
+
+		/* Fix-up array->scan_key references for arrays */
+		for (; arrayidx < so->numArrayKeys; arrayidx++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
+
+			Assert(array->num_elems > 0);
+
+			if (array->scan_key == input_ikey)
+			{
+				/* found it */
+				array->scan_key = output_ikey;
+				found = true;
+
+				/*
+				 * Transform array scan keys that have exactly 1 element
+				 * remaining (following all prior preprocessing) into
+				 * equivalent non-array scan keys.
+				 */
+				if (array->num_elems == 1)
+				{
+					outkey->sk_flags &= ~SK_SEARCHARRAY;
+					outkey->sk_argument = array->elem_values[0];
+					so->numArrayKeys--;
+
+					/* If we're out of array keys, we can quit right away */
+					if (so->numArrayKeys == 0)
+						return;
+
+					/* Shift other arrays forward */
+					memmove(array, array + 1,
+							sizeof(BTArrayKeyInfo) *
+							(so->numArrayKeys - arrayidx));
+
+					/*
+					 * Don't increment arrayidx (there was an entry that was
+					 * just shifted forward to the offset at arrayidx, which
+					 * will still need to be matched)
+					 */
+				}
+				else
+				{
+					/* Match found, so done with this array */
+					arrayidx++;
+				}
+
+				break;
+			}
+		}
+
+		Assert(found);
+	}
+}
+
+/*
+ * _bt_setup_array_cmp() -- Set up array comparison functions
+ *
+ * Sets ORDER proc in caller's orderproc argument, which is used during binary
+ * searches of arrays during the index scan.  Also sets a same-type ORDER proc
+ * in caller's *sortprocp argument, which is used when sorting the array.
+ *
+ * Preprocessing calls here with all equality strategy scan keys (when scan
+ * uses equality array keys), including those not associated with any array.
+ * See _bt_advance_array_keys for an explanation of why it'll need to treat
+ * simple scalar equality scan keys as degenerate single element arrays.
+ *
+ * Caller should pass an orderproc pointing to space that'll store the ORDER
+ * proc for the scan, and a *sortprocp pointing to its own separate space.
+ * When calling here for a non-array scan key, sortprocp arg should be NULL.
+ *
+ * In the common case where we don't need to deal with cross-type operators,
+ * only one ORDER proc is actually required by caller.  We'll set *sortprocp
+ * to point to the same memory that caller's orderproc continues to point to.
+ * Otherwise, *sortprocp will continue to point to caller's own space.  Either
+ * way, *sortprocp will point to a same-type ORDER proc (since that's the only
+ * safe way to sort/deduplicate the array associated with caller's scan key).
+ */
+static void
+_bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+					FmgrInfo *orderproc, FmgrInfo **sortprocp)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	RegProcedure cmp_proc;
+	Oid			opcintype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
+
+	/*
+	 * If scankey operator is not a cross-type comparison, we can use the
+	 * cached comparison function; otherwise gotta look it up in the catalogs
+	 */
+	if (elemtype == opcintype)
+	{
+		/* Set same-type ORDER procs for caller */
+		*orderproc = *index_getprocinfo(rel, skey->sk_attno, BTORDER_PROC);
+		if (sortprocp)
+			*sortprocp = orderproc;
+
+		return;
+	}
+
+	/*
+	 * Look up the appropriate cross-type comparison function in the opfamily.
+	 *
+	 * Use the opclass input type as the left hand arg type, and the array
+	 * element type as the right hand arg type (since binary searches use an
+	 * index tuple's attribute value to search for a matching array element).
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but only in cases where it's quite likely that _bt_first
+	 * would fail in just the same way (had we not failed before it could).
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 opcintype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, opcintype, elemtype, skey->sk_attno,
+			 RelationGetRelationName(rel));
+
+	/* Set cross-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+
+	/* Done if caller doesn't actually have an array they'll need to sort */
+	if (!sortprocp)
+		return;
+
+	/*
+	 * Look up the appropriate same-type comparison function in the opfamily.
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but it seems quite unlikely that an opfamily would omit
+	 * non-cross-type comparison procs for any datatype that it supports at
+	 * all.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 elemtype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, elemtype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set same-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, *sortprocp, so->arrayContext);
 }
 
 /*
@@ -370,27 +761,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
  * least element, or BTGreaterStrategyNumber to get the greatest.
  */
 static Datum
-_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
+_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey, Oid elemtype,
 						 StrategyNumber strat,
 						 Datum *elems, int nelems)
 {
 	Relation	rel = scan->indexRelation;
-	Oid			elemtype,
-				cmp_op;
+	Oid			cmp_op;
 	RegProcedure cmp_proc;
 	FmgrInfo	flinfo;
 	Datum		result;
 	int			i;
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
 	/*
 	 * Look up the appropriate comparison operator in the opfamily.
 	 *
@@ -399,6 +780,8 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	 * non-cross-type comparison operators for any datatype that it supports
 	 * at all.
 	 */
+	Assert(skey->sk_strategy != BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
 	cmp_op = get_opfamily_member(rel->rd_opfamily[skey->sk_attno - 1],
 								 elemtype,
 								 elemtype,
@@ -433,50 +816,21 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
  * The array elements are sorted in-place, and the new number of elements
  * after duplicate removal is returned.
  *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * skey identifies the index column whose opfamily determines the comparison
+ * semantics, and sortproc is a corresponding ORDER proc.  If reverse is true,
+ * we sort in descending order.
  */
 static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
+_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -487,6 +841,233 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge next array's elements into an original array
+ *
+ * Called when preprocessing encounters a pair of array equality scan keys,
+ * both against the same index attribute (during initial array preprocessing).
+ * Merging reorganizes caller's original array (the left hand arg) in-place,
+ * without ever copying elements from one array into the other. (Mixing the
+ * elements together like this would be wrong, since they don't necessarily
+ * use the same underlying element type, despite all the other similarities.)
+ *
+ * Both arrays must have already been sorted and deduplicated by calling
+ * _bt_sort_array_elements.  sortproc is the same-type ORDER proc that was
+ * just used to sort and deduplicate caller's "next" array.  We'll usually be
+ * able to reuse that order PROC to merge the arrays together now.  If not,
+ * then we'll perform a separate ORDER proc lookup.
+ *
+ * If the opfamily doesn't supply a complete set of cross-type ORDER procs we
+ * may not be able to determine which elements are contradictory.  If we have
+ * the required ORDER proc then we return true (and validly set *nelems_orig),
+ * guaranteeing that at least the next array can be considered redundant.  We
+ * return false if the required comparisons cannot not be made (caller must
+ * keep both arrays when this happens).
+ */
+static bool
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, FmgrInfo *sortproc,
+				 bool reverse, Oid origelemtype, Oid nextelemtype,
+				 Datum *elems_orig, int *nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+	int			nelems_orig_start = *nelems_orig,
+				nelems_orig_merged = 0;
+	FmgrInfo   *mergeproc = sortproc;
+	FmgrInfo	crosstypeproc;
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(origelemtype) && OidIsValid(nextelemtype));
+
+	if (origelemtype != nextelemtype)
+	{
+		RegProcedure cmp_proc;
+
+		/*
+		 * Cross-array-element-type merging is required, so can't just reuse
+		 * sortproc when merging
+		 */
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+									 origelemtype, nextelemtype, BTORDER_PROC);
+		if (!RegProcedureIsValid(cmp_proc))
+		{
+			/* Can't make the required comparisons */
+			return false;
+		}
+
+		/* We have all we need to determine redundancy/contradictoriness */
+		mergeproc = &crosstypeproc;
+		fmgr_info_cxt(cmp_proc, mergeproc, so->arrayContext);
+	}
+
+	cxt.sortproc = mergeproc;
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+
+	for (int i = 0, j = 0; i < nelems_orig_start && j < nelems_next;)
+	{
+		Datum	   *oelem = elems_orig + i,
+				   *nelem = elems_next + j;
+		int			res = _bt_compare_array_elements(oelem, nelem, &cxt);
+
+		if (res == 0)
+		{
+			elems_orig[nelems_orig_merged++] = *oelem;
+			i++;
+			j++;
+		}
+		else if (res < 0)
+			i++;
+		else					/* res > 0 */
+			j++;
+	}
+
+	*nelems_orig = nelems_orig_merged;
+
+	return true;
+}
+
+/*
+ * Compare an array scan key to a scalar scan key, eliminating contradictory
+ * array elements such that the scalar scan key becomes redundant.
+ *
+ * Array elements can be eliminated as contradictory when excluded by some
+ * other operator on the same attribute.  For example, with an index scan qual
+ * "WHERE a IN (1, 2, 3) AND a < 2", all array elements except the value "1"
+ * are eliminated, and the < scan key is eliminated as redundant.  Cases where
+ * every array element is eliminated by a redundant scalar scan key have an
+ * unsatisfiable qual, which we handle by setting *qual_ok=false for caller.
+ *
+ * If the opfamily doesn't supply a complete set of cross-type ORDER procs we
+ * may not be able to determine which elements are contradictory.  If we have
+ * the required ORDER proc then we return true (and validly set *qual_ok),
+ * guaranteeing that at least the scalar scan key can be considered redundant.
+ * We return false if the comparison could not be made (caller must keep both
+ * scan keys when this happens).
+ */
+static bool
+_bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+							   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+							   bool *qual_ok)
+{
+	Relation	rel = scan->indexRelation;
+	Oid			opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
+	int			cmpresult = 0,
+				cmpexact = 0,
+				matchelem,
+				new_nelems = 0;
+	FmgrInfo	crosstypeproc;
+	FmgrInfo   *orderprocp = orderproc;
+
+	Assert(arraysk->sk_attno == skey->sk_attno);
+	Assert(array->num_elems > 0);
+	Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
+		   arraysk->sk_strategy == BTEqualStrategyNumber);
+	Assert(!(skey->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert(!(skey->sk_flags & SK_SEARCHARRAY) ||
+		   skey->sk_strategy != BTEqualStrategyNumber);
+
+	/*
+	 * _bt_binsrch_array_skey is designed to search scan key arrays using
+	 * datums of whatever type the relevant rel opclass uses (on-disk type).
+	 *
+	 * We can reuse the array's ORDER proc whenever the non-array scan key's
+	 * type is a match for the corresponding attribute's input opclass type.
+	 * Otherwise, we have to do another ORDER proc lookup so that our call to
+	 * _bt_binsrch_array_skey applies the correct comparator.
+	 *
+	 * Note: we have to support the convention that sk_subtype == InvalidOid
+	 * means the opclass input type; this is a hack to simplify life for
+	 * ScanKeyInit().
+	 */
+	if (skey->sk_subtype != opcintype && skey->sk_subtype != InvalidOid)
+	{
+		RegProcedure cmp_proc;
+		Oid			arraysk_elemtype;
+
+		/*
+		 * Need an ORDER proc lookup to detect redundancy/contradictoriness
+		 * with this pair of scankeys.
+		 *
+		 * Scalar scan key's argument will be passed to _bt_compare_array_skey
+		 * as its tupdatum/lefthand argument (rhs arg is for array elements).
+		 */
+		arraysk_elemtype = arraysk->sk_subtype;
+		if (arraysk_elemtype == InvalidOid)
+			arraysk_elemtype = rel->rd_opcintype[arraysk->sk_attno - 1];
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[arraysk->sk_attno - 1],
+									 skey->sk_subtype, arraysk_elemtype,
+									 BTORDER_PROC);
+		if (!RegProcedureIsValid(cmp_proc))
+		{
+			/* Can't make the comparison */
+			*qual_ok = false;	/* suppress compiler warnings */
+			return false;
+		}
+
+		/* We have all we need to determine redundancy/contradictoriness */
+		orderprocp = &crosstypeproc;
+		fmgr_info(cmp_proc, orderprocp);
+	}
+
+	matchelem = _bt_binsrch_array_skey(orderprocp, false,
+									   NoMovementScanDirection,
+									   skey->sk_argument, false, array,
+									   arraysk, &cmpresult);
+
+	switch (skey->sk_strategy)
+	{
+		case BTLessStrategyNumber:
+			cmpexact = 1;		/* exclude exact match, if any */
+			/* FALL THRU */
+		case BTLessEqualStrategyNumber:
+			if (cmpresult >= cmpexact)
+				matchelem++;
+			/* Resize, keeping elements from the start of the array */
+			new_nelems = matchelem;
+			break;
+		case BTEqualStrategyNumber:
+			if (cmpresult != 0)
+			{
+				/* qual is unsatisfiable */
+				new_nelems = 0;
+			}
+			else
+			{
+				/* Shift matching element to the start of the array, resize */
+				array->elem_values[0] = array->elem_values[matchelem];
+				new_nelems = 1;
+			}
+			break;
+		case BTGreaterEqualStrategyNumber:
+			cmpexact = 1;		/* include exact match, if any */
+			/* FALL THRU */
+		case BTGreaterStrategyNumber:
+			if (cmpresult >= cmpexact)
+				matchelem++;
+			/* Shift matching elements to the start of the array, resize */
+			new_nelems = array->num_elems - matchelem;
+			memmove(array->elem_values, array->elem_values + matchelem,
+					sizeof(Datum) * new_nelems);
+			break;
+		default:
+			elog(ERROR, "unrecognized StrategyNumber: %d",
+				 (int) skey->sk_strategy);
+			break;
+	}
+
+	Assert(new_nelems >= 0);
+	Assert(new_nelems <= array->num_elems);
+
+	array->num_elems = new_nelems;
+	*qual_ok = new_nelems > 0;
+
+	return true;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -498,7 +1079,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->sortproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -506,11 +1087,235 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this array's
+ * scan key, and that the array is for a required scan key.  We can apply this
+ * information to find the next matching array element in the current scan
+ * direction using far fewer comparisons (fewer on average, compared to naive
+ * binary search).  This scheme takes advantage of an important property of
+ * required arrays: required arrays always advance in lockstep with the index
+ * scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_trig, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem = 0,
+				mid_elem = -1,
+				high_elem = array->num_elems - 1,
+				result = 0;
+	Datum		arrdatum;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur_elem_trig)
+	{
+		Assert(!ScanDirectionIsNoMovement(dir));
+		Assert(cur->sk_flags & SK_BT_REQFWD);
+
+		/*
+		 * When the scan key that triggered array advancement is a required
+		 * array scan key, it is now certain that the current array element
+		 * (plus all prior elements relative to the current scan direction)
+		 * cannot possibly be at or ahead of the corresponding tuple value.
+		 * (_bt_checkkeys must have called _bt_tuple_before_array_skeys, which
+		 * makes sure this is true as a condition of advancing the arrays.)
+		 *
+		 * This makes it safe to exclude array elements up to and including
+		 * the former-current array element from our search.
+		 *
+		 * Separately, when array advancement was triggered by a required scan
+		 * key, the array element immediately after the former-current element
+		 * is often either an exact tupdatum match, or a "close by" near-match
+		 * (a near-match tupdatum is one whose key space falls _between_ the
+		 * former-current and new-current array elements).  We'll detect both
+		 * cases via an optimistic comparison of the new search lower bound
+		 * (or new search upper bound in the case of backwards scans).
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			low_elem = array->cur_elem + 1; /* old cur_elem exhausted */
+
+			/* Compare prospective new cur_elem (also the new lower bound) */
+			if (high_elem >= low_elem)
+			{
+				arrdatum = array->elem_values[low_elem];
+				result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+												arrdatum, cur);
+
+				if (result <= 0)
+				{
+					/* Optimistic comparison optimization worked out */
+					*set_elem_result = result;
+					return low_elem;
+				}
+				mid_elem = low_elem;
+				low_elem++;		/* this cur_elem exhausted, too */
+			}
+
+			if (high_elem < low_elem)
+			{
+				/* Caller needs to perform "beyond end" array advancement */
+				*set_elem_result = 1;
+				return high_elem;
+			}
+		}
+		else
+		{
+			high_elem = array->cur_elem - 1;	/* old cur_elem exhausted */
+
+			/* Compare prospective new cur_elem (also the new upper bound) */
+			if (high_elem >= low_elem)
+			{
+				arrdatum = array->elem_values[high_elem];
+				result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+												arrdatum, cur);
+
+				if (result >= 0)
+				{
+					/* Optimistic comparison optimization worked out */
+					*set_elem_result = result;
+					return high_elem;
+				}
+				mid_elem = high_elem;
+				high_elem--;	/* this cur_elem exhausted, too */
+			}
+
+			if (high_elem < low_elem)
+			{
+				/* Caller needs to perform "beyond end" array advancement */
+				*set_elem_result = -1;
+				return low_elem;
+			}
+		}
+	}
+
+	while (high_elem > low_elem)
+	{
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * It's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the low_elem datum.  Must always set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
  * Set up the cur_elem counters and fill in the first sk_argument value for
- * each array scankey.  We can't do this until we know the scan direction.
+ * each array scankey.
  */
 void
 _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -518,159 +1323,1157 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			i;
 
+	Assert(so->numArrayKeys);
+	Assert(so->qual_ok);
+	Assert(!ScanDirectionIsNoMovement(dir));
+
 	for (i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 
 		Assert(curArrayKey->num_elems > 0);
+		Assert(skey->sk_flags & SK_SEARCHARRAY);
+
 		if (ScanDirectionIsBackward(dir))
 			curArrayKey->cur_elem = curArrayKey->num_elems - 1;
 		else
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 		int			cur_elem = curArrayKey->cur_elem;
 		int			num_elems = curArrayKey->num_elems;
+		bool		rolled = false;
 
-		if (ScanDirectionIsBackward(dir))
+		if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
 		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = 0;
+			rolled = true;
 		}
-		else
+		else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
 		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = num_elems - 1;
+			rolled = true;
 		}
 
 		curArrayKey->cur_elem = cur_elem;
 		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
+		if (!rolled)
+			return true;
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+		/* Need to advance next array key, if any */
+	}
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * The array keys are now exhausted.  (There isn't actually a distinct
+	 * state that represents array exhaustion, since index scans don't always
+	 * end after btgettuple returns "false".)
+	 *
+	 * Restore the array keys to the state they were in immediately before we
+	 * were called.  This ensures that the arrays only ever ratchet in the
+	 * current scan direction.  Without this, scans would overlook matching
+	 * tuples if and when the scan's direction was subsequently reversed.
 	 */
-	if (!found)
-		so->arraysStarted = false;
+	_bt_start_array_keys(scan, -dir);
 
-	return found;
+	return false;
 }
 
 /*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
+ * _bt_rewind_nonrequired_arrays() -- Rewind non-required arrays
  *
- * Save the current state of the array keys as the "mark" position.
+ * Called when _bt_advance_array_keys decides to start a new primitive index
+ * scan on the basis of the current scan position being before the position
+ * that _bt_first is capable of repositioning the scan to by applying an
+ * inequality operator required in the opposite-to-scan direction only.
+ *
+ * Although equality strategy scan keys (for both arrays and non-arrays alike)
+ * are either marked required in both directions or in neither direction,
+ * there is a sense in which non-required arrays behave like required arrays.
+ * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
+ * the scan key on "c" is non-required, but nevertheless enables positioning
+ * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
+ * first descent of the tree by _bt_first.  Later on, there could also be a
+ * second descent, that places the scan right before tuples >= "(200, 3, 5)".
+ * _bt_first must never be allowed to build an insertion scan key whose "c"
+ * entry is set to a value other than 5, the "c" array's first element/value.
+ * (Actually, it's the first in the current scan direction.  This example uses
+ * a forward scan.)
+ *
+ * Calling here resets the array scan key elements for the scan's non-required
+ * arrays.  This is strictly necessary for correctness in a subset of cases
+ * involving "required in opposite direction"-triggered primitive index scans.
+ * Not all callers are at risk of _bt_first using a non-required array like
+ * this, but advancement always resets the arrays when another primitive scan
+ * is scheduled, just to keep things simple.  Array advancement even makes
+ * sure to reset non-required arrays during scans that have no inequalities.
+ * (Advancement still won't call here when there are no inequalities, though
+ * that's just because it's all handled indirectly instead.)
+ *
+ * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
+ * everybody got this right.
  */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
+static void
+_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
+	int			arrayidx = 0;
 
-	for (i = 0; i < so->numArrayKeys; i++)
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		Assert(array->scan_key == ikey);
+
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+		{
+			array->cur_elem = first_elem_dir;
+			cur->sk_argument = array->elem_values[first_elem_dir];
+		}
 	}
 }
 
 /*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
+ * _bt_tuple_before_array_skeys() -- determine if tuple advances array keys
  *
- * Restore the array keys to where they were when the mark was set.
+ * We always compare the tuple using the current array keys (which we assume
+ * are already set in so->keyData[]).  readpagetup indicates if tuple is the
+ * scan's current _bt_readpage-wise tuple.
+ *
+ * readpagetup callers must only call here when _bt_check_compare already set
+ * continuescan=false.  We help these callers deal with _bt_check_compare's
+ * inability to distinguishing between the < and > cases (it uses equality
+ * operator scan keys, whereas we use 3-way ORDER procs).  These callers pass
+ * a _bt_check_compare-set sktrig value that indicates which scan key
+ * triggered the call (!readpagetup callers just pass us sktrig=0 instead).
+ * This information allows us to avoid wastefully checking earlier scan keys
+ * that were already deemed to have been satisfied inside _bt_check_compare.
+ *
+ * Returns false when caller's tuple is >= the current required equality scan
+ * keys (or <=, in the case of backwards scans).  This happens to readpagetup
+ * callers when the scan has reached the point of needing its array keys
+ * advanced; caller will need to advance required and non-required arrays at
+ * scan key offsets >= sktrig, plus scan keys < sktrig iff sktrig rolls over.
+ * (When we return false to readpagetup callers, tuple can only be == current
+ * required equality scan keys when caller's sktrig indicates that the arrays
+ * need to be advanced due to an unsatisfied required inequality key trigger.)
+ *
+ * Returns true when caller passes a tuple that is < the current set of
+ * equality keys for the most significant non-equal required scan key/column
+ * (or > the keys, during backwards scans).  This happens to readpagetup
+ * callers when tuple is still before the start of matches for the scan's
+ * required equality strategy scan keys.  (sktrig can't have indicated that an
+ * inequality strategy scan key wasn't satisfied in _bt_check_compare when we
+ * return true.  In fact, we automatically return false when passed such an
+ * inequality sktrig by readpagetup callers -- _bt_check_compare's initial
+ * continuescan=false doesn't really need to be confirmed here by us.)
+ *
+ * readpagetup callers shouldn't call here for unsatisfied non-required array
+ * scan keys, since _bt_check_compare is capable of handling those on its own
+ * (non-required array advancement can never roll over to higher order arrays,
+ * and so never affects required arrays, and so can't affect our answer).
+ *
+ * !readpagetup callers optionally pass us *scanBehind, which tracks whether
+ * any missing truncated attributes might have affected array advancement
+ * (compared to what would happen if it was shown the first non-pivot tuple on
+ * the page to the right of caller's finaltup/high key tuple instead).  It's
+ * only possible that we'll set *scanBehind to true when caller passes us a
+ * pivot tuple (with truncated attributes) that we return false for.
  */
-void
-_bt_restore_array_keys(IndexScanDesc scan)
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+							 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
+							 bool readpagetup, int sktrig, bool *scanBehind)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		changed = false;
-	int			i;
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->numArrayKeys);
+	Assert(so->numberOfKeys);
+	Assert(sktrig == 0 || readpagetup);
+	Assert(!readpagetup || scanBehind == NULL);
+
+	if (scanBehind)
+		*scanBehind = false;
+
+	for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		ScanKey		cur = so->keyData + ikey;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		/* readpagetup calls require one ORDER proc comparison (at most) */
+		Assert(!readpagetup || ikey == sktrig);
+
+		/*
+		 * Once we reach a non-required scan key, we're completely done.
+		 *
+		 * Note: we deliberately don't consider the scan direction here.
+		 * _bt_advance_array_keys caller requires that we track *scanBehind
+		 * without concern for scan direction.
+		 */
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) == 0)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
-			changed = true;
+			Assert(!readpagetup);
+			Assert(ikey > sktrig || ikey == 0);
+			return false;
+		}
+
+		if (cur->sk_attno > tupnatts)
+		{
+			Assert(!readpagetup);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys (but set *scanBehind to let interested callers know
+			 * that a truncated attribute might have affected our answer).
+			 */
+			if (scanBehind)
+				*scanBehind = true;
+
+			return false;
+		}
+
+		/*
+		 * Deal with inequality strategy scan keys that _bt_check_compare set
+		 * continuescan=false for
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+		{
+			/*
+			 * When _bt_check_compare indicated that a required inequality
+			 * scan key wasn't satisfied, there's no need to verify anything;
+			 * caller always calls _bt_advance_array_keys with this sktrig.
+			 */
+			if (readpagetup)
+				return false;
+
+			/*
+			 * Otherwise we can't give up, since we must check all required
+			 * scan keys (required in either direction) in order to correctly
+			 * track *scanBehind for caller
+			 */
+			continue;
+		}
+
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		result = _bt_compare_array_skey(&so->orderProcs[ikey],
+										tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?  (Must be if we get here during a readpagetup call.)
+		 */
+		if (readpagetup || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconclusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or a call made from an assertion.
+		 */
+		Assert(result == 0);
+	}
+
+	Assert(!readpagetup);
+
+	return false;
+}
+
+/*
+ * _bt_start_prim_scan() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys,
+ * after _bt_first or _bt_next return false.
+ */
+bool
+_bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  The flag tells us what to do.
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys.  There are
+	 * various cases where that won't happen.  For example, if the index is
+	 * completely empty, then _bt_first won't call _bt_readpage/_bt_checkkeys.
+	 * We also don't expect a call to _bt_checkkeys during searches for a
+	 * non-existent value that happens to be lower/higher than any existing
+	 * value in the index.
+	 *
+	 * We don't require special handling for these cases -- we don't need to
+	 * be explicitly instructed to _not_ perform another primitive index scan.
+	 * It's up to code under the control of _bt_first to always set the flag
+	 * when another primitive index scan will be required.
+	 *
+	 * This works correctly, even with the tricky cases listed above, which
+	 * all involve access to leaf pages "near the boundaries of the key space"
+	 * (whether it's from a leftmost/rightmost page, or an imaginary empty
+	 * leaf root page).  If _bt_checkkeys cannot be reached by a primitive
+	 * index scan for one set of array keys, then it also won't be reached for
+	 * any later set ("later" in terms of the direction that we scan the index
+	 * and advance the arrays).  The array keys won't have advanced in these
+	 * cases, but that's the correct behavior (even _bt_advance_array_keys
+	 * won't always advance the arrays at the point they become "exhausted").
+	 */
+	if (so->needPrimScan)
+	{
+		Assert(_bt_verify_arrays_bt_first(scan, dir));
+
+		/*
+		 * Flag was set -- must call _bt_first again, which will reset the
+		 * scan's needPrimScan flag
+		 */
+		so->scanBehind = false;
+
+		return true;
+	}
+
+	/* The top-level index scan ran out of tuples in this scan direction */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * The scan always gets a new qual as a consequence of calling here (except
+ * when we determine that the top-level scan has run out of matching tuples).
+ * All later _bt_check_compare calls also use the same new qual that was first
+ * used here (at least until the next call here advances the keys once again).
+ * It's convenient to structure _bt_check_compare rechecks of caller's tuple
+ * (using the new qual) as one the steps of advancing the scan's array keys,
+ * so this function works as a wrapper around _bt_check_compare.
+ *
+ * Like _bt_check_compare, we'll set pstate.continuescan on behalf of the
+ * caller, and return a boolean indicating if caller's tuple satisfies the
+ * scan's new qual.  But unlike _bt_check_compare, we set so->needPrimScan
+ * when we set continuescan=false, indicating if a new primitive index scan
+ * has been scheduled (otherwise, the top-level scan has run out of tuples in
+ * the current scan direction).
+ *
+ * Caller must use _bt_tuple_before_array_skeys to determine if the current
+ * place in the scan is >= the current array keys _before_ calling here.
+ * We're responsible for ensuring that caller's tuple is <= the newly advanced
+ * required array keys once we return.  We try to find an exact match, but
+ * failing that we'll advance the array keys to whatever set of array elements
+ * comes next in the key space for the current scan direction.  Required array
+ * keys "ratchet forwards" (or backwards).  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The rules are the same for backwards scans, except that the operators are
+ * flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Callers whose
+ * sktrig scan key is non-required specify sktrig_required=false.  These calls
+ * are the only exception to the general rule about always advancing the
+ * required array keys (the scan may not even have a required array).  These
+ * callers should just pass a NULL pstate (since there is never any question
+ * of stopping the scan).  No call to _bt_tuple_before_array_skeys is required
+ * ahead of these calls (it's already clear that any required scan keys must
+ * be satisfied by caller's tuple).
+ *
+ * Note that we deal with non-array required equality strategy scan keys as
+ * degenerate single element arrays here.  Obviously, they can never really
+ * advance in the way that real arrays can, but they must still affect how we
+ * advance real array scan keys (exactly like true array equality scan keys).
+ * We have to keep around a 3-way ORDER proc for these (using the "=" operator
+ * won't do), since in general whether the tuple is < or > _any_ unsatisfied
+ * required equality key influences how the scan's real arrays must advance.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing required array keys (and other required equality keys) are already
+ * an exact match for every corresponding value from caller's tuple.  We must
+ * do this for inequalities that _bt_check_compare set continuescan=false for.
+ * They'll advance the array keys here, just like any other scan key that
+ * _bt_check_compare stops on.  (This can even happen _after_ we advance the
+ * array keys, in which case we'll advance the array keys a second time.  That
+ * way _bt_checkkeys caller always has its required arrays advance to the
+ * maximum possible extent that its tuple will allow.)
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+					   int sktrig, bool sktrig_required)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate ? pstate->dir : ForwardScanDirection;
+	int			arrayidx = 0;
+	bool		beyond_end_advance = false,
+				has_required_opposite_direction_only = false,
+				oppodir_inequality_sktrig = false,
+				all_required_satisfied = true,
+				all_satisfied = true;
+
+	if (sktrig_required)
+	{
+		/*
+		 * Precondition array state assertions
+		 */
+		Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
+											 tupnatts, false, 0, NULL));
+
+		so->scanBehind = false; /* reset */
+
+		/*
+		 * Required scan key wasn't satisfied, so required arrays will have to
+		 * advance.  Invalidate page-level state that tracks whether the
+		 * scan's required-in-opposite-direction-only keys are known to be
+		 * satisfied by page's remaining tuples.
+		 */
+		pstate->firstmatch = false;
+
+		/* Shouldn't have to invalidate 'prechecked', though */
+		Assert(!pstate->prechecked);
+
+		/*
+		 * Once we return we'll have a new set of required array keys, whose
+		 * "tuple before array keys" recheck counter should start from 0.
+		 *
+		 * Note that we deliberately avoid touching targetdistance, since
+		 * that's still considered representative of the page as a whole.
+		 */
+		pstate->rechecks = 0;
+	}
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		Datum		tupdatum;
+		bool		required = false,
+					required_opposite_direction_only = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			/* Manage array state */
+			if (cur->sk_flags & SK_SEARCHARRAY)
+			{
+				array = &so->arrayKeys[arrayidx++];
+				Assert(array->scan_key == ikey);
+			}
+		}
+		else
+		{
+			/*
+			 * Are any inequalities required in the opposite direction only
+			 * present here?
+			 */
+			if (((ScanDirectionIsForward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQBKWD))) ||
+				 (ScanDirectionIsBackward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQFWD)))))
+				has_required_opposite_direction_only =
+					required_opposite_direction_only = true;
+		}
+
+		/* Optimization: skip over known-satisfied scan keys */
+		if (ikey < sktrig)
+			continue;
+
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			Assert(sktrig_required);
+
+			required = true;
+
+			if (cur->sk_attno > tupnatts)
+			{
+				/* Set this just like _bt_tuple_before_array_skeys */
+				Assert(sktrig < ikey);
+				so->scanBehind = true;
+			}
+		}
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now.  (The same underlying principle also gets
+		 * applied by the cur_elem_trig optimization used to speed up searches
+		 * for the next array element.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; we aren't capable of directly
+		 * evaluating required inequality strategy scan keys here, on our own.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(sktrig_required && required && all_required_satisfied);
+
+			/* Use "beyond end" advancement.  See below for an explanation. */
+			beyond_end_advance = true;
+			all_satisfied = all_required_satisfied = false;
+
+			/*
+			 * Set a flag that remembers that this was an inequality required
+			 * in the opposite scan direction only, that nevertheless
+			 * triggered the call here.
+			 *
+			 * This only happens when an inequality operator (which must be
+			 * strict) encounters a group of NULLs that indicate the end of
+			 * non-NULL values for tuples in the current scan direction.
+			 */
+			if (unlikely(required_opposite_direction_only))
+				oppodir_inequality_sktrig = true;
+
+			continue;
+		}
+
+		/*
+		 * Nothing more for us to do with an inequality strategy scan key that
+		 * wasn't the one that _bt_check_compare stopped on, though.
+		 *
+		 * Note: if our later call to _bt_check_compare (to recheck caller's
+		 * tuple) sets continuescan=false due to finding this same inequality
+		 * unsatisfied (possible when it's required in the scan direction),
+		 * we'll deal with it via a recursive "second pass" call.
+		 */
+		else if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Nothing for us to do with an equality strategy scan key that isn't
+		 * marked required, either -- unless it's a non-required array
+		 */
+		else if (!required && !array)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				cur->sk_argument = array->elem_values[final_elem_dir];
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_required_satisfied || cur->sk_attno > tupnatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				cur->sk_argument = array->elem_values[first_elem_dir];
+			}
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		cur_elem_trig = (sktrig_required && ikey == sktrig);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+											  cur_elem_trig, dir,
+											  tupdatum, tupnull, array, cur,
+											  &result);
+
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(sktrig_required && required);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single element array.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(&so->orderProcs[ikey],
+											tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling for required scan keys that lack
+		 * a real array to advance, nor for redundant scan keys that couldn't
+		 * be eliminated by _bt_preprocess_keys.  It won't matter if some of
+		 * our "true" array scan keys (or even all of them) are non-required.
+		 */
+		if (required &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		Assert(all_required_satisfied && all_satisfied);
+		if (result != 0)
+		{
+			/*
+			 * Track whether caller's tuple satisfies our new post-advancement
+			 * qual, for required scan keys, as well as for the entire set of
+			 * interesting scan keys (all required scan keys plus non-required
+			 * array scan keys are considered interesting.)
+			 */
+			all_satisfied = false;
+			if (required)
+				all_required_satisfied = false;
+			else
+			{
+				/*
+				 * There's no need to advance the arrays using the best
+				 * available match for a non-required array.  Give up now.
+				 * (Though note that sktrig_required calls still have to do
+				 * all the usual post-advancement steps, including the recheck
+				 * call to _bt_check_compare.)
+				 */
+				break;
+			}
+		}
+
+		/* Advance array keys, even when set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			cur->sk_argument = array->elem_values[set_elem];
 		}
 	}
 
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is how top-level index
+	 * scans usually determine that they've run out of tuples to return in the
+	 * current scan direction (less often the top-level scan just runs out of
+	 * tuples/pages before the scan's array keys are exhausted).
 	 */
-	if (changed || !so->arraysStarted)
-	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
-}
+	if (beyond_end_advance && !_bt_advance_array_keys_increment(scan, dir))
+		goto end_toplevel_scan;
 
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	/*
+	 * Does tuple now satisfy our new qual?  Recheck with _bt_check_compare.
+	 *
+	 * Calls triggered by an unsatisfied required scan key, whose tuple now
+	 * satisfies all required scan keys, but not all nonrequired array keys,
+	 * will still require a recheck call to _bt_check_compare.  They'll still
+	 * need its "second pass" handling of required inequality scan keys.
+	 * (Might have missed a still-unsatisfied required inequality scan key
+	 * that caller didn't detect as the sktrig scan key during its initial
+	 * _bt_check_compare call that used the old/original qual.)
+	 *
+	 * Calls triggered by an unsatisfied nonrequired array scan key never need
+	 * "second pass" handling of required inequalities (nor any other handling
+	 * of any required scan key).  All that matters is whether caller's tuple
+	 * satisfies the new qual, so it's safe to just skip the _bt_check_compare
+	 * recheck when we've already determined that it can only return 'false'.
+	 */
+	if ((sktrig_required && all_required_satisfied) ||
+		(!sktrig_required && all_satisfied))
+	{
+		int			nsktrig = sktrig + 1;
+		bool		continuescan;
+
+		Assert(all_required_satisfied);
+
+		/* Recheck _bt_check_compare on behalf of caller */
+		if (_bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+							  false, false, false,
+							  &continuescan, &nsktrig) &&
+			!so->scanBehind)
+		{
+			/* This tuple satisfies the new qual */
+			Assert(all_satisfied && continuescan);
+
+			if (pstate)
+				pstate->continuescan = true;
+
+			return true;
+		}
+
+		/*
+		 * Consider "second pass" handling of required inequalities.
+		 *
+		 * It's possible that our _bt_check_compare call indicated that the
+		 * scan should end due to some unsatisfied inequality that wasn't
+		 * initially recognized as such by us.  Handle this by calling
+		 * ourselves recursively, this time indicating that the trigger is the
+		 * inequality that we missed first time around (and using a set of
+		 * required array/equality keys that are now exact matches for tuple).
+		 *
+		 * We make a strong, general guarantee that every _bt_checkkeys call
+		 * here will advance the array keys to the maximum possible extent
+		 * that we can know to be safe based on caller's tuple alone.  If we
+		 * didn't perform this step, then that guarantee wouldn't quite hold.
+		 */
+		if (unlikely(!continuescan))
+		{
+			bool		satisfied PG_USED_FOR_ASSERTS_ONLY;
+
+			Assert(sktrig_required);
+			Assert(so->keyData[nsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * The tuple must use "beyond end" advancement during the
+			 * recursive call, so we cannot possibly end up back here when
+			 * recursing.  We'll consume a small, fixed amount of stack space.
+			 */
+			Assert(!beyond_end_advance);
+
+			/* Advance the array keys a second time using same tuple */
+			satisfied = _bt_advance_array_keys(scan, pstate, tuple, tupnatts,
+											   tupdesc, nsktrig, true);
+
+			/* This tuple doesn't satisfy the inequality */
+			Assert(!satisfied);
+			return false;
+		}
+
+		/*
+		 * Some non-required scan key (from new qual) still not satisfied.
+		 *
+		 * All scan keys required in the current scan direction must still be
+		 * satisfied, though, so we can trust all_required_satisfied below.
+		 */
+	}
+
+	/*
+	 * When we were called just to deal with "advancing" non-required arrays,
+	 * this is as far as we can go (cannot stop the scan for these callers)
+	 */
+	if (!sktrig_required)
+	{
+		/* Caller's tuple doesn't match any qual */
+		return false;
+	}
+
+	/*
+	 * Postcondition array state assertions (for still-unsatisfied tuples).
+	 *
+	 * By here we have established that the scan's required arrays (scan must
+	 * have at least one required array) advanced, without becoming exhausted.
+	 *
+	 * Caller's tuple is now < the newly advanced array keys (or > when this
+	 * is a backwards scan), except in the case where we only got this far due
+	 * to an unsatisfied non-required scan key.  Verify that with an assert.
+	 *
+	 * Note: we don't just quit at this point when all required scan keys were
+	 * found to be satisfied because we need to consider edge-cases involving
+	 * scan keys required in the opposite direction only; those aren't tracked
+	 * by all_required_satisfied. (Actually, oppodir_inequality_sktrig trigger
+	 * scan keys are tracked by all_required_satisfied, since it's convenient
+	 * for _bt_check_compare to behave as if they are required in the current
+	 * scan direction to deal with NULLs.  We'll account for that separately.)
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts,
+										false, 0, NULL) ==
+		   !all_required_satisfied);
+
+	/*
+	 * We generally permit primitive index scans to continue onto the next
+	 * sibling page when the page's finaltup satisfies all required scan keys
+	 * at the point where we're between pages.
+	 *
+	 * If caller's tuple is also the page's finaltup, and we see that required
+	 * scan keys still aren't satisfied, start a new primitive index scan.
+	 */
+	if (!all_required_satisfied && pstate->finaltup == tuple)
+		goto new_prim_scan;
+
+	/*
+	 * Proactively check finaltup (don't wait until finaltup is reached by the
+	 * scan) when it might well turn out to not be satisfied later on.
+	 *
+	 * Note: if so->scanBehind hasn't already been set for finaltup by us,
+	 * it'll be set during this call to _bt_tuple_before_array_skeys.  Either
+	 * way, it'll be set correctly (for the whole page) after this point.
+	 */
+	if (!all_required_satisfied && pstate->finaltup &&
+		_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
+									 BTreeTupleGetNAtts(pstate->finaltup, rel),
+									 false, 0, &so->scanBehind))
+		goto new_prim_scan;
+
+	/*
+	 * When we encounter a truncated finaltup high key attribute, we're
+	 * optimistic about the chances of its corresponding required scan key
+	 * being satisfied when we go on to check it against tuples from this
+	 * page's right sibling leaf page.  We consider truncated attributes to be
+	 * satisfied by required scan keys, which allows the primitive index scan
+	 * to continue to the next leaf page.  We must set so->scanBehind to true
+	 * to remember that the last page's finaltup had "satisfied" required scan
+	 * keys for one or more truncated attribute values (scan keys required in
+	 * _either_ scan direction).
+	 *
+	 * There is a chance that _bt_checkkeys (which checks so->scanBehind) will
+	 * find that even the sibling leaf page's finaltup is < the new array
+	 * keys.  When that happens, our optimistic policy will have incurred a
+	 * single extra leaf page access that could have been avoided.
+	 *
+	 * A pessimistic policy would give backward scans a gratuitous advantage
+	 * over forward scans.  We'd punish forward scans for applying more
+	 * accurate information from the high key, rather than just using the
+	 * final non-pivot tuple as finaltup, in the style of backward scans.
+	 * Being pessimistic would also give some scans with non-required arrays a
+	 * perverse advantage over similar scans that use required arrays instead.
+	 *
+	 * You can think of this as a speculative bet on what the scan is likely
+	 * to find on the next page.  It's not much of a gamble, though, since the
+	 * untruncated prefix of attributes must strictly satisfy the new qual
+	 * (though it's okay if any non-required scan keys fail to be satisfied).
+	 */
+	if (so->scanBehind && has_required_opposite_direction_only)
+	{
+		/*
+		 * However, we avoid this behavior whenever the scan involves a scan
+		 * key required in the opposite direction to the scan only, along with
+		 * a finaltup with at least one truncated attribute that's associated
+		 * with a scan key marked required (required in either direction).
+		 *
+		 * _bt_check_compare simply won't stop the scan for a scan key that's
+		 * marked required in the opposite scan direction only.  That leaves
+		 * us without any reliable way of reconsidering any opposite-direction
+		 * inequalities if it turns out that starting a new primitive index
+		 * scan will allow _bt_first to skip ahead by a great many leaf pages
+		 * (see next section for details of how that works).
+		 */
+		goto new_prim_scan;
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 * They can also signal that we should start a new primitive index scan.
+	 *
+	 * It's possible that the scan is now positioned where "matching" tuples
+	 * begin, and that caller's tuple satisfies all scan keys required in the
+	 * current scan direction.  But if caller's tuple still doesn't satisfy
+	 * other scan keys that are required in the opposite scan direction only
+	 * (e.g., a required >= strategy scan key when scan direction is forward),
+	 * it's still possible that there are many leaf pages before the page that
+	 * _bt_first could skip straight to.  Groveling through all those pages
+	 * will always give correct answers, but it can be very inefficient.  We
+	 * must avoid needlessly scanning extra pages.
+	 *
+	 * Separately, it's possible that _bt_check_compare set continuescan=false
+	 * for a scan key that's required in the opposite direction only.  This is
+	 * a special case, that happens only when _bt_check_compare sees that the
+	 * inequality encountered a NULL value.  This signals the end of non-NULL
+	 * values in the current scan direction, which is reason enough to end the
+	 * (primitive) scan.  If this happens at the start of a large group of
+	 * NULL values, then we shouldn't expect to be called again until after
+	 * the scan has already read indefinitely-many leaf pages full of tuples
+	 * with NULL suffix values.  We need a separate test for this case so that
+	 * we don't miss our only opportunity to skip over such a group of pages.
+	 *
+	 * Apply a test against finaltup to detect and recover from the problem:
+	 * if even finaltup doesn't satisfy such an inequality, we just skip by
+	 * starting a new primitive index scan.  When we skip, we know for sure
+	 * that all of the tuples on the current page following caller's tuple are
+	 * also before the _bt_first-wise start of tuples for our new qual.  That
+	 * at least suggests many more skippable pages beyond the current page.
+	 */
+	if (has_required_opposite_direction_only && pstate->finaltup &&
+		(all_required_satisfied || oppodir_inequality_sktrig))
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped;
+		bool		continuescanflip;
+		int			opsktrig;
+
+		/*
+		 * We're checking finaltup (which is usually not caller's tuple), so
+		 * cannot reuse work from caller's earlier _bt_check_compare call.
+		 *
+		 * Flip the scan direction when calling _bt_check_compare this time,
+		 * so that it will set continuescanflip=false when it encounters an
+		 * inequality required in the opposite scan direction.
+		 */
+		Assert(!so->scanBehind);
+		opsktrig = 0;
+		flipped = -dir;
+		_bt_check_compare(scan, flipped,
+						  pstate->finaltup, nfinaltupatts, tupdesc,
+						  false, false, false,
+						  &continuescanflip, &opsktrig);
+
+		/*
+		 * If we ended up here due to the all_required_satisfied criteria,
+		 * test opsktrig in a way that ensures that finaltup contains the same
+		 * prefix of key columns as caller's tuple (a prefix that satisfies
+		 * earlier required-in-current-direction scan keys).
+		 *
+		 * If we ended up here due to the oppodir_inequality_sktrig criteria,
+		 * test opsktrig in a way that ensures that the same scan key that our
+		 * caller found to be unsatisfied (by the scan's tuple) was also the
+		 * one unsatisfied just now (by finaltup).  That way we'll only start
+		 * a new primitive scan when we're sure that both tuples _don't_ share
+		 * the same prefix of satisfied equality-constrained attribute values,
+		 * and that finaltup has a non-NULL attribute value indicated by the
+		 * unsatisfied scan key at offset opsktrig/sktrig.  (This depends on
+		 * _bt_check_compare not caring about the direction that inequalities
+		 * are required in whenever NULL attribute values are unsatisfied.  It
+		 * only cares about the scan direction, and its relationship to
+		 * whether NULLs are stored first or last relative to non-NULLs.)
+		 */
+		Assert(all_required_satisfied != oppodir_inequality_sktrig);
+		if (unlikely(!continuescanflip &&
+					 ((all_required_satisfied && opsktrig > sktrig) ||
+					  (oppodir_inequality_sktrig && opsktrig >= sktrig))))
+		{
+			Assert(so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * Make sure that any non-required arrays are set to the first
+			 * array element for the current scan direction
+			 */
+			_bt_rewind_nonrequired_arrays(scan, dir);
+
+			goto new_prim_scan;
+		}
+	}
+
+	/*
+	 * Stick with the ongoing primitive index scan for now.
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will quickly discover the next tuple covered
+	 * by the current array keys.  Repeated binary searches of the page (one
+	 * binary search per array advancement) is unlikely to outperform one
+	 * continuous linear search of the whole page.
+	 *
+	 * Note: the linear search process has a "look ahead" mechanism that
+	 * allows _bt_checkkeys to detect cases where the page contains an
+	 * excessive number of "before array key" tuples.  If there is a large
+	 * group of non-matching tuples (tuples located before the key space where
+	 * we expect to find the first tuple matching the new array keys), then
+	 * _bt_readpage is instructed to skip over many of those tuples.
+	 */
+	pstate->continuescan = true;	/* Override _bt_check_compare */
+	so->needPrimScan = false;	/* _bt_readpage has more tuples to check */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+new_prim_scan:
+
+	/*
+	 * End this primitive index scan, but schedule another.
+	 *
+	 * Note: If the scan direction happens to change, this scheduled primitive
+	 * index scan won't go ahead after all.
+	 */
+	if (!scan->parallel_scan ||
+		_bt_parallel_primscan_advance(scan, pstate->prev_scan_page))
+	{
+		pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+		so->needPrimScan = true;	/* ...but call _bt_first again */
+	}
+	else
+	{
+		/*
+		 * Failed to advance parallel index scan's primitive index scan
+		 * counter.  Some other backend must have done so before we could.
+		 * This backend must call _bt_parallel_readpage to read new pages from
+		 * the part of the index that the other backend has descended to.
+		 */
+		pstate->continuescan = true;	/* Override _bt_check_compare */
+		so->needPrimScan = false;	/* pick it up in _bt_parallel_readpage */
+
+		/* Reset arrays, avoid scanning any more tuples from this page */
+		_bt_start_array_keys(scan, dir);
+		so->scanBehind = false;
+		if (ScanDirectionIsForward(dir))
+			pstate->skip = pstate->maxoff + 1;
+		else
+			pstate->skip = pstate->minoff - 1;
+	}
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+end_toplevel_scan:
+
+	/*
+	 * End the current primitive index scan, but don't schedule another.
+	 *
+	 * This ends the entire top-level scan in the current scan direction.
+	 *
+	 * Note: The scan's arrays (including any non-required arrays) are now in
+	 * their final positions for the current scan direction.  If the scan
+	 * direction happens to change, then the arrays will already be in their
+	 * first positions for what will then be the current scan direction.
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = false;	/* ...don't call _bt_first again, though */
+
+	/* Caller's tuple doesn't match any qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
  *
- * The given search-type keys (in scan->keyData[] or so->arrayKeyData[])
+ * The given search-type keys (taken from scan->keyData[])
  * are copied to so->keyData[] with possible transformation.
  * scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
  * the number of output keys (possibly less, never greater).
@@ -690,8 +2493,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * The output keys must be sorted by index attribute.  Presently we expect
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
- * within each attribute may be done as a byproduct of the processing here,
- * but no other code depends on that.
+ * within each attribute may be done as a byproduct of the processing here.
+ * That process must leave array scan keys (within an attribute) in the same
+ * order as corresponding entries from the scan's BTArrayKeyInfo array info
+ * (though the issue only comes up when _bt_preprocess_array_keys can't merge
+ * duplicative arrays together for lack of a suitable cross-type ORDER proc).
  *
  * The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
  * if they must be satisfied in order to continue the scan forward or backward
@@ -748,8 +2554,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
  *
  * Note: the reason we have to copy the preprocessed scan keys into private
  * storage is that we are modifying the array based on comparisons of the
- * key argument values, which could change on a rescan or after moving to
- * new elements of array keys.  Therefore we can't overwrite the source data.
+ * key argument values, which could change on a rescan.  Therefore we can't
+ * overwrite the source data.
  */
 void
 _bt_preprocess_keys(IndexScanDesc scan)
@@ -762,11 +2568,32 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
-	ScanKey		xform[BTMaxStrategyNumber];
+	BTScanKeyPreproc xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
 	AttrNumber	attno;
+	ScanKey		arrayKeyData;
+	int		   *keyDataMap = NULL;
+	int			arrayidx = 0;
+
+	/*
+	 * We're called at the start of each primitive index scan during top-level
+	 * scans that use equality array keys.  We can reuse the scan keys that
+	 * were output at the start of the scan's first primitive index scan.
+	 * There is no need to perform exactly the same work more than once.
+	 */
+	if (so->numberOfKeys > 0)
+	{
+		/*
+		 * An earlier call to _bt_advance_array_keys already set everything up
+		 * for us.  Just assert that the scan's existing output scan keys are
+		 * consistent with its current array elements.
+		 */
+		Assert(so->numArrayKeys);
+		Assert(_bt_verify_keys_with_arraykeys(scan));
+		return;
+	}
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -775,11 +2602,27 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
+	arrayKeyData = _bt_preprocess_array_keys(scan);
+	if (!so->qual_ok)
+	{
+		/* unmatchable array, so give up */
+		return;
+	}
+
 	/*
-	 * Read so->arrayKeyData if array keys are present, else scan->keyData
+	 * Treat arrayKeyData[] (a partially preprocessed copy of scan->keyData[])
+	 * as our input if _bt_preprocess_array_keys just allocated it, else just
+	 * use scan->keyData[]
 	 */
-	if (so->arrayKeyData != NULL)
-		inkeys = so->arrayKeyData;
+	if (arrayKeyData)
+	{
+		inkeys = arrayKeyData;
+
+		/* Also maintain keyDataMap for remapping so->orderProc[] later */
+		keyDataMap = MemoryContextAlloc(so->arrayContext,
+										numberOfKeys * sizeof(int));
+	}
 	else
 		inkeys = scan->keyData;
 
@@ -800,6 +2643,18 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
 			_bt_mark_scankey_required(outkeys);
+		if (arrayKeyData)
+		{
+			/*
+			 * Don't call _bt_preprocess_array_keys_final in this fast path
+			 * (we'll miss out on the single value array transformation, but
+			 * that's not nearly as important when there's only one scan key)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+				   so->arrayKeys[0].scan_key == 0);
+		}
+
 		return;
 	}
 
@@ -859,13 +2714,29 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 * check, and we've rejected any combination of it with a regular
 			 * equality condition; but not with other types of conditions.
 			 */
-			if (xform[BTEqualStrategyNumber - 1])
+			if (xform[BTEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1];
+				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
+				BTArrayKeyInfo *array = NULL;
+				FmgrInfo   *orderproc = NULL;
+
+				if (arrayKeyData && (eq->sk_flags & SK_SEARCHARRAY))
+				{
+					int			eq_in_ikey,
+								eq_arrayidx;
+
+					eq_in_ikey = xform[BTEqualStrategyNumber - 1].ikey;
+					eq_arrayidx = xform[BTEqualStrategyNumber - 1].arrayidx;
+					array = &so->arrayKeys[eq_arrayidx - 1];
+					orderproc = so->orderProcs + eq_in_ikey;
+
+					Assert(array->scan_key == eq_in_ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
 
 				for (j = BTMaxStrategyNumber; --j >= 0;)
 				{
-					ScanKey		chk = xform[j];
+					ScanKey		chk = xform[j].skey;
 
 					if (!chk || j == (BTEqualStrategyNumber - 1))
 						continue;
@@ -878,6 +2749,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					}
 
 					if (_bt_compare_scankey_args(scan, chk, eq, chk,
+												 array, orderproc,
 												 &test_result))
 					{
 						if (!test_result)
@@ -887,7 +2759,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							return;
 						}
 						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						Assert(!array || array->num_elems > 0);
+						xform[j].skey = NULL;
+						xform[j].ikey = -1;
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -896,36 +2770,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			}
 
 			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1]
-				&& xform[BTLessEqualStrategyNumber - 1])
+			if (xform[BTLessStrategyNumber - 1].skey
+				&& xform[BTLessEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1];
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1];
+				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, le, lt, le,
+				if (_bt_compare_scankey_args(scan, le, lt, le, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1] = NULL;
+						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTLessStrategyNumber - 1] = NULL;
+						xform[BTLessStrategyNumber - 1].skey = NULL;
 				}
 			}
 
 			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1]
-				&& xform[BTGreaterEqualStrategyNumber - 1])
+			if (xform[BTGreaterStrategyNumber - 1].skey
+				&& xform[BTGreaterEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1];
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1];
+				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, ge, gt, ge,
+				if (_bt_compare_scankey_args(scan, ge, gt, ge, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1] = NULL;
+						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTGreaterStrategyNumber - 1] = NULL;
+						xform[BTGreaterStrategyNumber - 1].skey = NULL;
 				}
 			}
 
@@ -936,11 +2810,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 */
 			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				if (xform[j])
+				if (xform[j].skey)
 				{
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-					memcpy(outkey, xform[j], sizeof(ScanKeyData));
+					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+					if (arrayKeyData)
+						keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 					if (priorNumberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
 				}
@@ -966,6 +2842,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
+			if (arrayKeyData)
+				keyDataMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 
@@ -977,20 +2855,112 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
-		/* have we seen one of these before? */
-		if (xform[j] == NULL)
+		/*
+		 * Does this input scan key require further processing as an array?
+		 */
+		if (cur->sk_strategy == InvalidStrategy)
 		{
-			/* nope, so remember this scankey */
-			xform[j] = cur;
+			/* _bt_preprocess_array_keys marked this array key redundant */
+			Assert(arrayKeyData);
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			continue;
+		}
+
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			(cur->sk_flags & SK_SEARCHARRAY))
+		{
+			/* _bt_preprocess_array_keys kept this array key */
+			Assert(arrayKeyData);
+			arrayidx++;
+		}
+
+		/*
+		 * have we seen a scan key for this same attribute and using this same
+		 * operator strategy before now?
+		 */
+		if (xform[j].skey == NULL)
+		{
+			/* nope, so this scan key wins by default (at least for now) */
+			xform[j].skey = cur;
+			xform[j].ikey = i;
+			xform[j].arrayidx = arrayidx;
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
-										 &test_result))
+			FmgrInfo   *orderproc = NULL;
+			BTArrayKeyInfo *array = NULL;
+
+			/*
+			 * Seen one of these before, so keep only the more restrictive key
+			 * if possible
+			 */
+			if (j == (BTEqualStrategyNumber - 1) && arrayKeyData)
 			{
+				/*
+				 * Have to set up array keys
+				 */
+				if ((cur->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[arrayidx - 1];
+					orderproc = so->orderProcs + i;
+
+					Assert(array->scan_key == i);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+				else if ((xform[j].skey->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[xform[j].arrayidx - 1];
+					orderproc = so->orderProcs + xform[j].ikey;
+
+					Assert(array->scan_key == xform[j].ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+
+				/*
+				 * Both scan keys might have arrays, in which case we'll
+				 * arbitrarily pass only one of the arrays.  That won't
+				 * matter, since _bt_compare_scankey_args is aware that two
+				 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
+				 * failed to eliminate redundant arrays through array merging.
+				 * _bt_compare_scankey_args just returns false when it sees
+				 * this; it won't even try to examine either array.
+				 */
+			}
+
+			if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+										 array, orderproc, &test_result))
+			{
+				/* Have all we need to determine redundancy */
 				if (test_result)
-					xform[j] = cur;
+				{
+					Assert(!array || array->num_elems > 0);
+
+					/*
+					 * New key is more restrictive, and so replaces old key...
+					 */
+					if (j != (BTEqualStrategyNumber - 1) ||
+						!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
+					{
+						Assert(!array || array->scan_key == i);
+						xform[j].skey = cur;
+						xform[j].ikey = i;
+						xform[j].arrayidx = arrayidx;
+					}
+					else
+					{
+						/*
+						 * ...unless we have to keep the old key because it's
+						 * an array that rendered the new key redundant.  We
+						 * need to make sure that we don't throw away an array
+						 * scan key.  _bt_compare_scankey_args expects us to
+						 * always keep arrays (and discard non-arrays).
+						 */
+						Assert(j == (BTEqualStrategyNumber - 1));
+						Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
+						Assert(xform[j].ikey == array->scan_key);
+						Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+					}
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1002,22 +2972,130 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			else
 			{
 				/*
-				 * We can't determine which key is more restrictive.  Keep the
-				 * previous one in xform[j] and push this one directly to the
-				 * output array.
+				 * We can't determine which key is more restrictive.  Push
+				 * xform[j] directly to the output array, then set xform[j] to
+				 * the new scan key.
+				 *
+				 * Note: We do things this way around so that our arrays are
+				 * always in the same order as their corresponding scan keys,
+				 * even with incomplete opfamilies.  _bt_advance_array_keys
+				 * depends on this.
 				 */
 				ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-				memcpy(outkey, cur, sizeof(ScanKeyData));
+				memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+				if (arrayKeyData)
+					keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 				if (numberOfEqualCols == attno - 1)
 					_bt_mark_scankey_required(outkey);
+				xform[j].skey = cur;
+				xform[j].ikey = i;
+				xform[j].arrayidx = arrayidx;
 			}
 		}
 	}
 
 	so->numberOfKeys = new_numberOfKeys;
+
+	/*
+	 * Now that we've output so->keyData[], and built a temporary mapping from
+	 * so->keyData[] (output scan keys) to scan->keyData[] (input scan keys),
+	 * fix each array->scan_key reference.  (Also consolidates so->orderProc[]
+	 * array, so it can be subscripted using so->keyData[]-wise offsets.)
+	 */
+	if (arrayKeyData)
+		_bt_preprocess_array_keys_final(scan, keyDataMap);
+
+	/* Could pfree arrayKeyData/keyDataMap now, but not worth the cycles */
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * Verify that the scan's qual state matches what we expect at the point that
+ * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
+ *
+ * We enforce a rule against non-required array scan keys: they must start out
+ * with whatever element is the first for the scan's current scan direction.
+ * See _bt_rewind_nonrequired_arrays comments for an explanation.
+ */
+static bool
+_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
+
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+			return false;
+	}
+
+	return _bt_verify_keys_with_arraykeys(scan);
+}
+
+/*
+ * Verify that the scan's "so->keyData[]" scan keys are in agreement with
+ * its array key state
+ */
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			last_sk_attno = InvalidAttrNumber,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		if (array->scan_key != ikey)
+			return false;
+
+		if (array->num_elems <= 0)
+			return false;
+
+		if (cur->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+		if (last_sk_attno > cur->sk_attno)
+			return false;
+		last_sk_attno = cur->sk_attno;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1033,9 +3111,24 @@ _bt_preprocess_keys(IndexScanDesc scan)
  * we store the operator result in *result and return true.  We return false
  * if the comparison could not be made.
  *
+ * If either leftarg or rightarg are an array, we'll apply array-specific
+ * rules to determine which array elements are redundant on behalf of caller.
+ * It is up to our caller to save whichever of the two scan keys is the array,
+ * and discard the non-array scan key (the non-array scan key is guaranteed to
+ * be redundant with any complete opfamily).  Caller isn't expected to call
+ * here with a pair of array scan keys provided we're dealing with a complete
+ * opfamily (_bt_preprocess_array_keys will merge array keys together to make
+ * sure of that).
+ *
+ * Note: we'll also shrink caller's array as needed to eliminate redundant
+ * array elements.  One reason why caller should prefer to discard non-array
+ * scan keys is so that we'll have the opportunity to shrink the array
+ * multiple times, in multiple calls (for each of several other scan keys on
+ * the same index attribute).
+ *
  * Note: op always points at the same ScanKey as either leftarg or rightarg.
- * Since we don't scribble on the scankeys, this aliasing should cause no
- * trouble.
+ * Since we don't scribble on the scankeys themselves, this aliasing should
+ * cause no trouble.
  *
  * Note: this routine needs to be insensitive to any DESC option applied
  * to the index column.  For example, "x < 4" is a tighter constraint than
@@ -1044,6 +3137,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 static bool
 _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 						 ScanKey leftarg, ScanKey rightarg,
+						 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 						 bool *result)
 {
 	Relation	rel = scan->indexRelation;
@@ -1112,6 +3206,48 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 		return true;
 	}
 
+	/*
+	 * If either leftarg or rightarg are equality-type array scankeys, we need
+	 * specialized handling (since by now we know that IS NULL wasn't used)
+	 */
+	if (array)
+	{
+		bool		leftarray,
+					rightarray;
+
+		leftarray = ((leftarg->sk_flags & SK_SEARCHARRAY) &&
+					 leftarg->sk_strategy == BTEqualStrategyNumber);
+		rightarray = ((rightarg->sk_flags & SK_SEARCHARRAY) &&
+					  rightarg->sk_strategy == BTEqualStrategyNumber);
+
+		/*
+		 * _bt_preprocess_array_keys is responsible for merging together array
+		 * scan keys, and will do so whenever the opfamily has the required
+		 * cross-type support.  If it failed to do that, we handle it just
+		 * like the case where we can't make the comparison ourselves.
+		 */
+		if (leftarray && rightarray)
+		{
+			/* Can't make the comparison */
+			*result = false;	/* suppress compiler warnings */
+			return false;
+		}
+
+		/*
+		 * Otherwise we need to determine if either one of leftarg or rightarg
+		 * uses an array, then pass this through to a dedicated helper
+		 * function.
+		 */
+		if (leftarray)
+			return _bt_compare_array_scankey_args(scan, leftarg, rightarg,
+												  orderproc, array, result);
+		else if (rightarray)
+			return _bt_compare_array_scankey_args(scan, rightarg, leftarg,
+												  orderproc, array, result);
+
+		/* FALL THRU */
+	}
+
 	/*
 	 * The opfamily we need to worry about is identified by the index column.
 	 */
@@ -1351,60 +3487,234 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
  * Forward scan callers can pass a high key tuple in the hopes of having
  * us set *continuescan to false, and avoiding an unnecessary visit to
  * the page to the right.
  *
+ * Advances the scan's array keys when necessary for arrayKeys=true callers.
+ * Caller can avoid all array related side-effects when calling just to do a
+ * page continuescan precheck -- pass arrayKeys=false for that.  Scans without
+ * any arrays keys must always pass arrayKeys=false.
+ *
+ * Also stops and starts primitive index scans for arrayKeys=true callers.
+ * Scans with array keys are required to set up page state that helps us with
+ * this.  The page's finaltup tuple (the page high key for a forward scan, or
+ * the page's first non-pivot tuple for a backward scan) must be set in
+ * pstate.finaltup ahead of the first call here for the page (or possibly the
+ * first call after an initial continuescan-setting page precheck call).  Set
+ * this to NULL for rightmost page (or the leftmost page for backwards scans).
+ *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: page level input and output parameters
+ * arrayKeys: should we advance the scan's array keys if necessary?
  * tuple: index tuple to test
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
- * 						   be true for the last item on the page
- * haveFirstMatch: indicates that we already have at least one match
- * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
-			  bool continuescanPrechecked, bool haveFirstMatch)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+			  IndexTuple tuple, int tupnatts)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanDirection dir = pstate->dir;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
 
-	*continuescan = true;		/* default assumption */
+	res = _bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+							arrayKeys, pstate->prechecked, pstate->firstmatch,
+							&pstate->continuescan, &ikey);
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+#ifdef USE_ASSERT_CHECKING
+	if (!arrayKeys && so->numArrayKeys)
 	{
-		Datum		datum;
-		bool		isNull;
-		Datum		test;
-		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+		/*
+		 * This is a continuescan precheck call for a scan with array keys.
+		 *
+		 * Assert that the scan isn't in danger of becoming confused.
+		 */
+		Assert(!so->scanBehind && !pstate->prechecked && !pstate->firstmatch);
+		Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
+											 tupnatts, false, 0, NULL));
+	}
+	if (pstate->prechecked || pstate->firstmatch)
+	{
+		bool		dcontinuescan;
+		int			dikey = 0;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Call relied on continuescan/firstmatch prechecks -- assert that we
+		 * get the same answer without those optimizations
+		 */
+		Assert(res == _bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+										arrayKeys, false, false,
+										&dcontinuescan, &dikey));
+		Assert(pstate->continuescan == dcontinuescan);
+	}
+#endif
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality strategy array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * pstate.continuescan=false.
+	 */
+	if (!arrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.  This could mean that the tuple is just past
+	 * the end of matches for the current array keys.
+	 *
+	 * It's also possible that the scan is still _before_ the _start_ of
+	 * tuples matching the current set of array keys.  Check for that first.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts, true,
+									 ikey, NULL))
+	{
+		/*
+		 * Tuple is still before the start of matches according to the scan's
+		 * required array keys (according to _all_ of its required equality
+		 * strategy keys, actually).
+		 *
+		 * _bt_advance_array_keys occasionally sets so->scanBehind to signal
+		 * that the scan's current position/tuples might be significantly
+		 * behind (multiple pages behind) its current array keys.  When this
+		 * happens, we need to be prepared to recover by starting a new
+		 * primitive index scan here, on our own.
+		 */
+		Assert(!so->scanBehind ||
+			   so->keyData[ikey].sk_strategy == BTEqualStrategyNumber);
+		if (unlikely(so->scanBehind) && pstate->finaltup &&
+			_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
+										 BTreeTupleGetNAtts(pstate->finaltup,
+															scan->indexRelation),
+										 false, 0, NULL))
+		{
+			/* Cut our losses -- start a new primitive index scan now */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/* Override _bt_check_compare, continue primitive scan */
+			pstate->continuescan = true;
+
+			/*
+			 * We will end up here repeatedly given a group of tuples > the
+			 * previous array keys and < the now-current keys (for a backwards
+			 * scan it's just the same, though the operators swap positions).
+			 *
+			 * We must avoid allowing this linear search process to scan very
+			 * many tuples from well before the start of tuples matching the
+			 * current array keys (or from well before the point where we'll
+			 * once again have to advance the scan's array keys).
+			 *
+			 * We keep the overhead under control by speculatively "looking
+			 * ahead" to later still-unscanned items from this same leaf page.
+			 * We'll only attempt this once the number of tuples that the
+			 * linear search process has examined starts to get out of hand.
+			 */
+			pstate->rechecks++;
+			if (pstate->rechecks >= LOOK_AHEAD_REQUIRED_RECHECKS)
+			{
+				/* See if we should skip ahead within the current leaf page */
+				_bt_check_look_ahead(scan, pstate, dir, tupnatts, tupdesc);
+
+				/*
+				 * Might have set pstate.skip to a later page offset.  When
+				 * that happens then _bt_readpage caller will inexpensively
+				 * skip ahead to a later tuple from the same page (the one
+				 * just after the tuple we successfully "looked ahead" to).
+				 */
+			}
+		}
+
+		/* This indextuple doesn't match the current qual, in any case */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scan).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan.
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, tupnatts, tupdesc,
+								  ikey, true);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also sets *continuescan to false
+ * when it's also not possible for any later tuples to pass the current qual
+ * (with the scan's current set of array keys, in the current scan direction),
+ * in addition to setting *ikey to the so->keyData[] subscript/offset for the
+ * unsatisfied scan key (needed when caller must consider advancing the scan's
+ * array keys).
+ *
+ * This is a subroutine for _bt_checkkeys.  We provisionally assume that
+ * reaching the end of the current set of required keys (in particular the
+ * current required array keys) ends the ongoing (primitive) index scan.
+ * Callers without array keys should just end the scan right away when they
+ * find that continuescan has been set to false here by us.  Things are more
+ * complicated for callers with array keys.
+ *
+ * Callers with array keys must first consider advancing the arrays when
+ * continuescan has been set to false here by us.  They must then consider if
+ * it really does make sense to end the current (primitive) index scan, in
+ * light of everything that is known at that point.  (In general when we set
+ * continuescan=false for these callers it must be treated as provisional.)
+ *
+ * We deal with advancing unsatisfied non-required arrays directly, though.
+ * This is safe, since by definition non-required keys can't end the scan.
+ * This is just how we determine if non-required arrays are just unsatisfied
+ * by the current array key, or if they're truly unsatisfied (that is, if
+ * they're unsatisfied by every possible array key).
+ *
+ * Though we advance non-required array keys on our own, that shouldn't have
+ * any lasting consequences for the scan.  By definition, non-required arrays
+ * have no fixed relationship with the scan's progress.  (There are delicate
+ * considerations for non-required arrays when the arrays need to be advanced
+ * following our setting continuescan to false, but that doesn't concern us.)
+ *
+ * Pass advancenonrequired=false to avoid all array related side effects.
+ * This allows _bt_advance_array_keys caller to avoid infinite recursion.
+ */
+static bool
+_bt_check_compare(IndexScanDesc scan, ScanDirection dir,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool advancenonrequired, bool prechecked, bool firstmatch,
+				  bool *continuescan, int *ikey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	*continuescan = true;		/* default assumption */
+
+	for (; *ikey < so->numberOfKeys; (*ikey)++)
+	{
+		ScanKey		key = so->keyData + *ikey;
+		Datum		datum;
+		bool		isNull;
+		bool		requiredSameDir = false,
+					requiredOppositeDirOnly = false;
+
+		/*
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1422,8 +3732,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
-			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
+		if (prechecked &&
+			(requiredSameDir || (requiredOppositeDirOnly && firstmatch)) &&
+			!(key->sk_flags & SK_ROW_HEADER))
 			continue;
 
 		if (key->sk_attno > tupnatts)
@@ -1434,7 +3745,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1495,6 +3805,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1511,6 +3823,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1524,24 +3838,15 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key-checking function, though only if we must.
+		 *
+		 * When a key is required in the opposite-of-scan direction _only_,
+		 * then it must already be satisfied if firstmatch=true indicates that
+		 * an earlier tuple from this same page satisfied it earlier on.
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
-		{
-			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-									 datum, key->sk_argument);
-		}
-		else
-		{
-			test = true;
-			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
-											 datum, key->sk_argument));
-		}
-
-		if (!DatumGetBool(test))
+		if (!(requiredOppositeDirOnly && firstmatch) &&
+			!DatumGetBool(FunctionCall2Coll(&key->sk_func, key->sk_collation,
+											datum, key->sk_argument)))
 		{
 			/*
 			 * Tuple fails this qual.  If it's a required qual for the current
@@ -1557,7 +3862,19 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				*continuescan = false;
 
 			/*
-			 * In any case, this indextuple doesn't match the qual.
+			 * If this is a non-required equality-type array key, the tuple
+			 * needs to be checked against every possible array key.  Handle
+			 * this by "advancing" the scan key's array to a matching value
+			 * (if we're successful then the tuple might match the qual).
+			 */
+			else if (advancenonrequired &&
+					 key->sk_strategy == BTEqualStrategyNumber &&
+					 (key->sk_flags & SK_SEARCHARRAY))
+				return _bt_advance_array_keys(scan, NULL, tuple, tupnatts,
+											  tupdesc, *ikey, false);
+
+			/*
+			 * This indextuple doesn't match the qual.
 			 */
 			return false;
 		}
@@ -1574,7 +3891,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1603,7 +3920,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
@@ -1630,6 +3946,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1646,6 +3964,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1741,6 +4061,103 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 	return result;
 }
 
+/*
+ * Determine if a scan with array keys should consider looking ahead.
+ *
+ * This is a subroutine for _bt_checkkeys.  It limits the worst case cost of
+ * _bt_readpage's linear search.  Many scans that use array keys won't run
+ * into this problem, since "looking ahead" to pstate.finaltup at the point
+ * that the scan's arrays advance usually suffices.  It is worth controlling
+ * the cost of _bt_readpage's _bt_checkkeys-based linear search on pages that
+ * contain key space that matches several distinct array keys (or distinct
+ * sets of array keys) spaced apart by dozens or hundreds of non-pivot tuples.
+ *
+ * When we perform look ahead, and the process succeeds, sets pstate.skip,
+ * which instructs _bt_readpage to skip ahead to that tuple next (could be
+ * past the end of the scan's leaf page).
+ *
+ * We ramp the look ahead distance up as it continues to be effective, and
+ * aggressively decrease it when it stops working.  Cases where looking ahead
+ * is very effective will still require several calls here per _bt_readpage.
+ *
+ * Calling here stashes information about the progress of array advancement on
+ * the page using certain private fields in pstate.  We need to track our
+ * progress so far to control ramping the optimization up (and down).
+ */
+static void
+_bt_check_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
+					 ScanDirection dir, int tupnatts, TupleDesc tupdesc)
+{
+	OffsetNumber mid;
+	IndexTuple	ahead;
+	int			distance;
+
+	/*
+	 * Don't look ahead when there aren't enough tuples remaining on the page
+	 * (in the current scan direction) for it to really make sense.
+	 */
+	distance = LOOK_AHEAD_DEFAULT_DISTANCE;
+	if (ScanDirectionIsForward(dir))
+	{
+		if (pstate->offnum >= pstate->maxoff - distance)
+			return;
+
+		/* Also don't do anything with high key */
+		if (pstate->offnum < pstate->minoff)
+			return;
+
+		mid = pstate->offnum + distance;
+	}
+	else
+	{
+		if (pstate->offnum <= pstate->minoff + distance)
+			return;
+
+		mid = pstate->offnum - distance;
+	}
+
+	/*
+	 * The look ahead distance starts small, and ramps up as each call here
+	 * allows _bt_readpage to skip ever-more tuples from the current page
+	 */
+	if (!pstate->targetdistance)
+		pstate->targetdistance = distance;
+	else
+		pstate->targetdistance *= 2;
+
+	if (ScanDirectionIsForward(dir))
+		mid = Min(pstate->maxoff, pstate->offnum + pstate->targetdistance);
+	else
+		mid = Max(pstate->minoff, pstate->offnum - pstate->targetdistance);
+
+	ahead = (IndexTuple) PageGetItem(pstate->page,
+									 PageGetItemId(pstate->page, mid));
+	if (_bt_tuple_before_array_skeys(scan, dir, ahead, tupdesc, tupnatts,
+									 false, 0, NULL))
+	{
+		/*
+		 * Success -- instruct _bt_readpage to skip ahead to very next tuple
+		 * after the one we determined was still before the current array keys
+		 */
+		if (ScanDirectionIsForward(dir))
+			pstate->skip = mid + 1;
+		else
+			pstate->skip = mid - 1;
+	}
+	else
+	{
+		/*
+		 * Failure -- "ahead" tuple is too far ahead (we were too aggresive).
+		 *
+		 * Reset the number of rechecks, and aggressively reduce the target
+		 * distance.  Note that we're much more aggressive here than when
+		 * initially ramping up.
+		 */
+		pstate->rechecks = 0;
+		pstate->targetdistance /= 8;
+	}
+}
+
 /*
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbd..2230b1310 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,12 +816,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
 	int			indexcol;
 
+	Assert(skip_nonnative_saop != NULL || scantype == ST_BITMAPSCAN);
+
 	/*
 	 * Check that index supports the desired scan type(s)
 	 */
@@ -880,19 +849,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +864,18 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			if (skip_nonnative_saop && !index->amsearcharray &&
+				IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
-				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
+				/*
+				 * Caller asked us to generate IndexPaths that omit any
+				 * ScalarArrayOpExpr clauses when the underlying index AM
+				 * lacks native support.
+				 *
+				 * We must omit this clause (and tell caller about it).
+				 */
+				*skip_nonnative_saop = true;
+				continue;
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d..772dc664f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6557,8 +6557,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6573,7 +6571,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed.
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6603,19 +6601,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6653,27 +6640,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6683,11 +6674,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan.
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6703,10 +6692,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6719,7 +6706,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6797,7 +6784,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6812,17 +6798,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6862,14 +6843,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				double		alength = estimate_array_length(root, other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6929,13 +6905,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6945,6 +6914,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6952,9 +6963,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +6982,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac28..e49a4c0c1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4064,6 +4064,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Queries that use certain <acronym>SQL</acronym> constructs to search for
+    rows matching any value out of a list or array of multiple scalar values
+    (see <xref linkend="functions-comparisons"/>) perform multiple
+    <quote>primitive</quote> index scans (up to one primitive scan per scalar
+    value) during query execution.  Each internal primitive index scan
+    increments <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>,
+    so it's possible for the count of index scans to significantly exceed the
+    total number of index scan executor node executions.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 70ab47a92..ef7b84662 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1932,16 +1932,16 @@ ORDER BY unique1;
       42
 (3 rows)
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,29 +1952,26 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
  thousand | tenthous 
 ----------+----------
-        0 |     3000
         1 |     1001
+        0 |     3000
 (2 rows)
 
-RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 63cddac0d..8b640c2fc 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8880,10 +8880,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..90a33795d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -765,6 +765,7 @@ SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -774,18 +775,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
-
-RESET enable_indexonlyscan;
+ORDER BY thousand DESC, tenthous DESC;
 
 --
 -- Check elimination of constant-NULL subexpressions
-- 
2.43.0

#59

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Peter Geoghegan (#58)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Tue, Mar 26, 2024 at 2:01 PM Peter Geoghegan <pg@bowt.ie> wrote:

v17 replaces _bt_parallel_advance_array_keys with a new function,
called _bt_parallel_primscan_advance, which is now called from
_bt_advance_array_keys. The new function doesn't need to increment
btscan->btps_arrayKeyCount (nor does it increment a local copy in
BTScanOpaqueData.arrayKeyCount). It is called at the specific point
that array key advancement *tries* to start another primitive index
scan. The difference (relative to the serial case) is that it might
not succeed in doing so. It might not be possible to start another
primitive index scan, since it might turn out that some other backend
(one page ahead, or even one page behind) already did it for
everybody.

There was a bug in v17 around how we deal with parallel index scans.
Under the right circumstances, a parallel scan with array keys put all
of its worker processes to sleep, without subsequently waking them up
(so the scan would hang indefinitely). The underlying problem was that
v17 assumed that the backend that scheduled the next primitive index
scan would also be the one that actually performs that same primitive
scan. While it is natural and desirable for the backend that schedules
the primitive scan to be the one that actually calls _bt_search, v17
tacitly relied on that to successfully finish the scan. It's
particularly important that the leader process always be able to stop
participating as a worker immediately, whenever it needs to, and for
as long as it needs to.

Attached is v18, which adds an additional layer of indirection to fix
the bug: worker processes now schedule primitive index scans
explicitly. They store their current array keys in shared memory when
primitive scans are scheduled (during a parallel scan), allowing other
backends to pick things up where the scheduling backend left off. And
so with v18, any other backend can be the one that performs the actual
descent of the index within _bt_search.

The design is essentially the same as before, though. We'd still
prefer that it be the backend that scheduled the primitive index scan.
Other backends might well be busy finishing off their scan of
immediately preceding leaf pages.

Separately, v18 also adds many new regression tests. This greatly
improves the general situation around test coverage, which I'd put off
before now. Now there is coverage for parallel scans with array keys,
as well as coverage for the new array-specific preprocessing code.

Note: v18 doesn't have any adjustments to the costing, as originally
planned. I'll probably need to post a revised patch with improved (or
at least polished) costing in the next few days, so that others will
have the opportunity to comment before I commit the patch.

--
Peter Geoghegan

Attachments:

v18-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/x-patch; name=v18-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 98a31588f3b06dc1001aa72161972399d12a0e81 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v18] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans will now be executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allows us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (paths that used SAOP filter quals instead).
We'll no longer generate these alternative paths, since they can no
longer have any meaningful advantages over standard index qual paths.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions.  Such an index AM could have the
same limitations around order SOAP scans as nbtree had before now.
Although it seems unlikely that such an AM actually exists, it still
warrants a pro forma compatibility item in the release notes.

Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/amapi.h                    |    2 +-
 src/include/access/genam.h                    |    3 +-
 src/include/access/nbtree.h                   |   80 +-
 src/backend/access/index/indexam.c            |   10 +-
 src/backend/access/nbtree/nbtree.c            |  230 +-
 src/backend/access/nbtree/nbtsearch.c         |  237 +-
 src/backend/access/nbtree/nbtutils.c          | 2938 +++++++++++++++--
 src/backend/executor/nodeIndexonlyscan.c      |    2 +
 src/backend/executor/nodeIndexscan.c          |    2 +
 src/backend/optimizer/path/indxpath.c         |   90 +-
 src/backend/utils/adt/selfuncs.c              |  122 +-
 doc/src/sgml/indexam.sgml                     |   10 +-
 doc/src/sgml/monitoring.sgml                  |   13 +
 src/test/regress/expected/btree_index.out     |   52 +
 src/test/regress/expected/create_index.out    |  203 +-
 src/test/regress/expected/join.out            |    5 +-
 src/test/regress/expected/select_parallel.out |   25 +
 src/test/regress/sql/btree_index.sql          |   15 +
 src/test/regress/sql/create_index.sql         |   64 +-
 src/test/regress/sql/select_parallel.sql      |    8 +
 src/tools/pgindent/typedefs.list              |    2 +
 21 files changed, 3511 insertions(+), 602 deletions(-)

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 2c6c307ef..00300dd72 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -194,7 +194,7 @@ typedef void (*amrestrpos_function) (IndexScanDesc scan);
  */
 
 /* estimate size of parallel scan descriptor */
-typedef Size (*amestimateparallelscan_function) (void);
+typedef Size (*amestimateparallelscan_function) (int nkeys, int norderbys);
 
 /* prepare for parallel index scan */
 typedef void (*aminitparallelscan_function) (void *target);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8026c2b36..fdcfbe8db 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -165,7 +165,8 @@ extern void index_rescan(IndexScanDesc scan,
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
-extern Size index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot);
+extern Size index_parallelscan_estimate(Relation indexRelation,
+										int nkeys, int norderbys, Snapshot snapshot);
 extern void index_parallelscan_initialize(Relation heapRelation,
 										  Relation indexRelation, Snapshot snapshot,
 										  ParallelIndexScanDesc target);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052..6c2136814 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,11 +960,21 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
 
+	/*
+	 * Direction of the scan at the time that _bt_readpage was called.
+	 *
+	 * Used by btrestrpos to "restore" the scan's array keys by resetting each
+	 * array to its first element's value (first in this scan direction).
+	 * _bt_checkkeys can quickly advance the array keys when required, so this
+	 * avoids the need to directly track the array keys in btmarkpos.
+	 */
+	ScanDirection dir;
+
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
 	 * location in the associated tuple storage workspace.
@@ -1022,9 +1032,8 @@ typedef BTScanPosData *BTScanPos;
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
-	int			scan_key;		/* index of associated key in arrayKeyData */
+	int			scan_key;		/* index of associated key in keyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1037,14 +1046,11 @@ typedef struct BTScanOpaqueData
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
 	/* workspace for SK_SEARCHARRAY support */
-	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
-	int			numArrayKeys;	/* number of equality-type array keys (-1 if
-								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	int			numArrayKeys;	/* number of equality-type array keys */
+	bool		needPrimScan;	/* New prim scan to continue in current dir? */
+	bool		scanBehind;		/* First match for new keys on later page? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for all equality-type keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1081,42 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
+	ScanDirection dir;			/* current scan direction */
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	IndexTuple	finaltup;		/* Needed by scans with array keys */
+	BlockNumber prev_scan_page; /* Last _bt_parallel_release block */
+	Page		page;			/* For array keys "look ahead" optimization */
+
+	/* Per-tuple input parameters, set by _bt_readpage for _bt_checkkeys */
+	OffsetNumber offnum;		/* current tuple's page offset number */
+
+	/* Output parameter, set by _bt_checkkeys for _bt_readpage */
+	OffsetNumber skip;			/* Array keys "look ahead" skip offnum */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/*
+	 * Input and output parameters, set and unset by both _bt_readpage and
+	 * _bt_checkkeys to manage precheck optimizations
+	 */
+	bool		prechecked;		/* precheck set continuescan? */
+	bool		firstmatch;		/* at least one match so far?  */
+
+	/*
+	 * Private _bt_checkkeys state (used to manage "look ahead" optimization,
+	 * used only during scans that have array keys)
+	 */
+	uint16		rechecks;
+	uint16		targetdistance;
+
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1128,7 +1170,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
-extern Size btestimateparallelscan(void);
+extern Size btestimateparallelscan(int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
@@ -1149,10 +1191,12 @@ extern bool btcanreturn(Relation index, int attno);
 /*
  * prototypes for internal functions in nbtree.c
  */
-extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
+extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
+							   bool first);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern bool _bt_parallel_primscan_schedule(IndexScanDesc scan,
+										   BlockNumber prev_scan_page);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1243,15 +1287,11 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_preprocess_array_keys(IndexScanDesc scan);
+extern bool _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+						  IndexTuple tuple, int tupnatts);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 78ac3b1ab..7510159fc 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -449,13 +449,10 @@ index_restrpos(IndexScanDesc scan)
 
 /*
  * index_parallelscan_estimate - estimate shared memory for parallel scan
- *
- * Currently, we don't pass any information to the AM-specific estimator,
- * so it can probably only return a constant.  In the future, we might need
- * to pass more information.
  */
 Size
-index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot)
+index_parallelscan_estimate(Relation indexRelation, int nkeys, int norderbys,
+							Snapshot snapshot)
 {
 	Size		nbytes;
 
@@ -474,7 +471,8 @@ index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot)
 	 */
 	if (indexRelation->rd_indam->amestimateparallelscan != NULL)
 		nbytes = add_size(nbytes,
-						  indexRelation->rd_indam->amestimateparallelscan());
+						  indexRelation->rd_indam->amestimateparallelscan(nkeys,
+																		  norderbys));
 
 	return nbytes;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 41df1027d..aa41f7f14 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -40,6 +40,9 @@
 /*
  * BTPARALLEL_NOT_INITIALIZED indicates that the scan has not started.
  *
+ * BTPARALLEL_NEED_PRIMSCAN indicates that some process must now seize the
+ * scan to advance it via another call to _bt_first.
+ *
  * BTPARALLEL_ADVANCING indicates that some process is advancing the scan to
  * a new page; others must wait.
  *
@@ -47,11 +50,11 @@
  * to a new page; some process can start doing that.
  *
  * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
  */
 typedef enum
 {
 	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_NEED_PRIMSCAN,
 	BTPARALLEL_ADVANCING,
 	BTPARALLEL_IDLE,
 	BTPARALLEL_DONE,
@@ -67,10 +70,14 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
-	slock_t		btps_mutex;		/* protects above variables */
+	slock_t		btps_mutex;		/* protects above variables, btps_arrElems */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+
+	/*
+	 * btps_arrElems is used when scans need to schedule another primitive
+	 * index scan.  Holds BTArrayKeyInfo.cur_elem offsets for scan keys.
+	 */
+	int			btps_arrElems[FLEXIBLE_ARRAY_MEMBER];
 }			BTParallelScanDescData;
 
 typedef struct BTParallelScanDescData *BTParallelScanDesc;
@@ -204,21 +211,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
-	/*
-	 * If we have any array keys, initialize them during first call for a
-	 * scan.  We can't do this in btrescan because we don't know the scan
-	 * direction at that time.
-	 */
-	if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return false;
-
-		_bt_start_array_keys(scan, dir);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -260,8 +253,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
 	return res;
 }
@@ -276,19 +269,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
-	/*
-	 * If we have any array keys, initialize them.
-	 */
-	if (so->numArrayKeys)
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return ntids;
-
-		_bt_start_array_keys(scan, ForwardScanDirection);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -318,8 +299,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -348,10 +329,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	else
 		so->keyData = NULL;
 
-	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
+	so->scanBehind = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -391,7 +373,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->numArrayKeys = 0;
+	so->needPrimScan = false;
+	so->scanBehind = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -425,9 +409,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
-
-	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
-	_bt_preprocess_array_keys(scan);
 }
 
 /*
@@ -455,7 +436,7 @@ btendscan(IndexScanDesc scan)
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
-	/* so->arrayKeyData and so->arrayKeys are in arrayContext */
+	/* so->arrayKeys is in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
 	if (so->killedItems != NULL)
@@ -490,10 +471,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -504,10 +481,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -546,6 +519,13 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Reset the scan's array keys (see _bt_steppage for why) */
+			if (so->numArrayKeys)
+			{
+				_bt_start_array_keys(scan, so->currPos.dir);
+				so->needPrimScan = false;
+				so->scanBehind = false;
+			}
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -556,9 +536,10 @@ btrestrpos(IndexScanDesc scan)
  * btestimateparallelscan -- estimate storage for BTParallelScanDescData
  */
 Size
-btestimateparallelscan(void)
+btestimateparallelscan(int nkeys, int norderbys)
 {
-	return sizeof(BTParallelScanDescData);
+	/* Pessimistically assume every input scan key will be an array */
+	return offsetof(BTParallelScanDescData, btps_arrElems) + sizeof(int) * nkeys;
 }
 
 /*
@@ -572,7 +553,6 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -598,7 +578,6 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -608,23 +587,26 @@ btparallelrescan(IndexScanDesc scan)
  *		or _bt_parallel_done().
  *
  * The return value is true if we successfully seized the scan and false
- * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * if we did not.  The latter case occurs if no pages remain.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
- * means the scan hasn't yet started, and P_NONE means we've reached the end.
- * The first time a participating process reaches the last page, it will return
- * true and set *pageno to P_NONE; after that, further attempts to seize the
- * scan will return false.
+ * means the scan hasn't yet started, or that caller needs to start the next
+ * primitive index scan (if it's the latter case we'll set so.needPrimScan).
+ * The first time a participating process reaches the last page, it will
+ * return true and set *pageno to P_NONE; after that, further attempts to
+ * seize the scan will return false.
  *
  * Callers should ignore the value of pageno if the return value is false.
+ *
+ * Callers that are in a position to start a new primitive index scan must
+ * pass first=true; all other callers just pass false.  We just return false
+ * for first=false callers that need to start the next primitive index scan.
  */
 bool
-_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BTPS_State	pageStatus;
 	bool		exit_loop = false;
 	bool		status = true;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -632,28 +614,67 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 
 	*pageno = P_NONE;
 
+	if (first)
+	{
+		/*
+		 * Initialize array related state when called from _bt_first, assuming
+		 * that this will either be the first primitive index scan for the
+		 * scan, or a previous explicitly scheduled primitive scan.
+		 *
+		 * Note: so->needPrimScan should only be set for an explicitly
+		 * scheduled primitive index scan.
+		 */
+		so->needPrimScan = false;
+		so->scanBehind = false;
+	}
+	else
+	{
+		/*
+		 * Don't attempt to seize the scan when another primitive index scan
+		 * has been scheduled within this backend, unless called by _bt_first
+		 */
+		if (so->needPrimScan)
+			return false;
+	}
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
 	while (1)
 	{
 		SpinLockAcquire(&btscan->btps_mutex);
-		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* We're done with this parallel index scan */
 			status = false;
 		}
-		else if (pageStatus == BTPARALLEL_DONE)
+		else if (btscan->btps_pageStatus == BTPARALLEL_NEED_PRIMSCAN)
 		{
+			Assert(so->numArrayKeys);
+
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * If we can start another primitive scan right away, do so.
+			 * Otherwise just wait.
 			 */
-			status = false;
+			if (first)
+			{
+				btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+				for (int i = 0; i < so->numArrayKeys; i++)
+				{
+					BTArrayKeyInfo *array = &so->arrayKeys[i];
+					ScanKey		skey = &so->keyData[array->scan_key];
+
+					array->cur_elem = btscan->btps_arrElems[i];
+					skey->sk_argument = array->elem_values[array->cur_elem];
+				}
+				so->needPrimScan = true;
+				so->scanBehind = false;
+				*pageno = InvalidBlockNumber;
+				exit_loop = true;
+			}
 		}
-		else if (pageStatus != BTPARALLEL_ADVANCING)
+		else if (btscan->btps_pageStatus != BTPARALLEL_ADVANCING)
 		{
 			/*
 			 * We have successfully seized control of the scan for the purpose
@@ -677,6 +698,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
  * _bt_parallel_release() -- Complete the process of advancing the scan to a
  *		new page.  We now have the new value btps_scanPage; some other backend
  *		can now begin advancing the scan.
+ *
+ * Callers whose scan uses array keys must save their scan_page argument so
+ * that it can be passed to _bt_parallel_primscan_schedule, should caller
+ * determine that another primitive index scan is required.  If that happens,
+ * scan_page won't be scanned by any backend (unless the next primitive index
+ * scan lands on scan_page, which is something we generally try to avoid).
  */
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
@@ -704,7 +731,6 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 void
 _bt_parallel_done(IndexScanDesc scan)
 {
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
@@ -717,13 +743,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the parallel scan as done, unless some other process did so
+	 * already
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
-		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	if (btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
 		status_changed = true;
@@ -736,31 +760,63 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_primscan_schedule() -- Schedule another primitive index scan.
  *
- * Updates the count of array keys processed for both local and parallel
- * scans.
+ * Caller passes the block number most recently passed to _bt_parallel_release
+ * by its backend.  Caller successfully schedules the next primitive index scan
+ * if the shared parallel state hasn't been seized since caller's backend last
+ * advanced the scan.
+ *
+ * Returns true when caller has successfully scheduled another primitive index
+ * scan.  Caller should proceed as in the serial case.  Caller's backend will
+ * often be the backend that reaches _bt_first and descends the index.  We
+ * serialize caller's array key state, so that other backends have a way of
+ * performing caller's scheduled primitive index scan as and when necessary.
+ *
+ * Returns false when another backend already seized the scan.  We'll leave it
+ * up to the other backend to schedule the next primitive index scan (possibly
+ * with array keys that are later in the scan direction).
  */
-void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+bool
+_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
+	bool		advanced = false;
+
+	Assert(so->numArrayKeys);
 
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	if (btscan->btps_scanPage == prev_scan_page &&
+		btscan->btps_pageStatus == BTPARALLEL_IDLE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
-		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
+
+		/* Serialize scan's current array keys */
+		for (int i = 0; i < so->numArrayKeys; i++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[i];
+
+			btscan->btps_arrElems[i] = array->cur_elem;
+		}
+
+		advanced = true;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
+
+	/*
+	 * Notify another worker, just in case this backend takes a while to
+	 * arrive back in _bt_first
+	 */
+	if (advanced)
+		ConditionVariableSignal(&btscan->btps_cv);
+
+	return advanced;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e3fff90d8..aadfd522d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,7 +907,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -917,10 +916,22 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * scan has not started, proceed to find out first leaf page in the usual
 	 * way while keeping other participating processes waiting.  If the scan
 	 * has already begun, use the page number from the shared structure.
+	 *
+	 * When a parallel scan has another primitive index scan scheduled, a
+	 * parallel worker will seize the scan for that purpose now.  This is
+	 * similar to the case where the top-level scan hasn't started.
 	 */
 	if (scan->parallel_scan != NULL)
 	{
-		status = _bt_parallel_seize(scan, &blkno);
+		status = _bt_parallel_seize(scan, &blkno, true);
+
+		/*
+		 * Initialize arrays (when _bt_parallel_seize didn't already set up
+		 * the next primitive index scan)
+		 */
+		if (so->numArrayKeys && !so->needPrimScan)
+			_bt_start_array_keys(scan, dir);
+
 		if (!status)
 			return false;
 		else if (blkno == P_NONE)
@@ -935,6 +946,16 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			goto readcomplete;
 		}
 	}
+	else if (so->numArrayKeys && !so->needPrimScan)
+	{
+		/*
+		 * First _bt_first call (for current btrescan) during serial scan.
+		 *
+		 * Initialize arrays, and the corresponding scan keys that were just
+		 * output by _bt_preprocess_keys.
+		 */
+		_bt_start_array_keys(scan, dir);
+	}
 
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
@@ -1527,11 +1548,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
-	bool		continuescanPrechecked;
-	bool		haveFirstMatch = false;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1542,19 +1562,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
 
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	pstate.dir = dir;
+	pstate.minoff = minoff;
+	pstate.maxoff = maxoff;
+	pstate.finaltup = NULL;
+	pstate.page = page;
+	pstate.offnum = InvalidOffsetNumber;
+	pstate.skip = InvalidOffsetNumber;
+	pstate.continuescan = true; /* default assumption */
+	pstate.prechecked = false;
+	pstate.firstmatch = false;
+	pstate.rechecks = 0;
+	pstate.targetdistance = 0;
+
+	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	arrayKeys = so->numArrayKeys != 0;
+
 	/* allow next page be processed by parallel worker */
 	if (scan->parallel_scan)
 	{
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, opaque->btpo_next);
+			pstate.prev_scan_page = opaque->btpo_next;
 		else
-			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
-	}
+			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
-	minoff = P_FIRSTDATAKEY(opaque);
-	maxoff = PageGetMaxOffsetNumber(page);
+		_bt_parallel_release(scan, pstate.prev_scan_page);
+	}
 
 	/*
 	 * We note the buffer's block number so that we can release the pin later.
@@ -1598,10 +1634,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * corresponding value from the last item on the page.  So checking with
 	 * the last item on the page would give a more precise answer.
 	 *
-	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * We skip this for the scan's first page to avoid slowing down point
+	 * queries.  We also have to avoid applying the optimization in rare cases
+	 * where it's not yet clear that the scan is at or ahead of its current
+	 * array keys.  If we're behind, but not too far behind (the start of
+	 * tuples matching the current keys is somewhere before the last item),
+	 * then the optimization is unsafe.
+	 *
+	 * Cases with multiple distinct sets of required array keys for key space
+	 * from the same leaf page can _attempt_ to use the precheck optimization,
+	 * though.  It won't work out, but there's no better way of figuring that
+	 * out than just optimistically attempting the precheck.
+	 *
+	 * The array keys safety issue is related to our reliance on _bt_first
+	 * passing us an offnum that's exactly at the beginning of where equal
+	 * tuples are to be found.  The underlying problem is that we have no
+	 * built-in ability to tell the difference between the start of required
+	 * equality matches and the end of required equality matches.  Array key
+	 * advancement within _bt_checkkeys has to act as a "_bt_first surrogate"
+	 * whenever the start of tuples matching the next set of array keys is
+	 * close to the end of tuples matching the current/last set of array keys.
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !so->scanBehind && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1664,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Do the precheck, while avoiding advancing the scan's array keys
+		 * prematurely
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, false, itup, indnatts);
+		pstate.prechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,23 +1706,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
 			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
 			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum < pstate.skip);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1696,7 +1758,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1775,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			pstate.prechecked = false;	/* precheck didn't cover HIKEY */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1796,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys && minoff <= maxoff && !P_LEFTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1772,23 +1843,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
 			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
 			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum > pstate.skip);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1824,7 +1900,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1970,6 +2046,33 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				   so->currPos.nextTupleOffset);
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
+
+		/*
+		 * If we're just about to start the next primitive index scan
+		 * (possible with a scan that has arrays keys, and needs to skip to
+		 * continue in the current scan direction), moreLeft/moreRight only
+		 * indicate the end of the current primitive index scan.  They must
+		 * never be taken to indicate that the top-level index scan has ended
+		 * (that would be wrong).
+		 *
+		 * We could handle this case by treating the current array keys as
+		 * markPos state.  But depending on the current array state like this
+		 * would add complexity.  Instead, we just unset markPos's copy of
+		 * moreRight or moreLeft (whichever might be affected), while making
+		 * btrestpos reset the scan's arrays to their initial scan positions.
+		 * In effect, btrestpos leaves advancing the arrays up to the first
+		 * _bt_readpage call (that takes place after it has restored markPos).
+		 * As long as the index key space is never ahead of the current array
+		 * keys, _bt_readpage handles this correctly (and efficiently).
+		 */
+		Assert(so->markPos.dir == dir);
+		if (so->needPrimScan)
+		{
+			if (ScanDirectionIsForward(dir))
+				so->markPos.moreRight = true;
+			else
+				so->markPos.moreLeft = true;
+		}
 	}
 
 	if (ScanDirectionIsForward(dir))
@@ -1981,7 +2084,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			 * Seize the scan to get the next block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno);
+			status = _bt_parallel_seize(scan, &blkno, false);
 			if (!status)
 			{
 				/* release the previous buffer, if pinned */
@@ -2013,7 +2116,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			 * Seize the scan to get the current block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno);
+			status = _bt_parallel_seize(scan, &blkno, false);
 			BTScanPosUnpinIfPinned(so->currPos);
 			if (!status)
 			{
@@ -2097,7 +2200,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno);
+				status = _bt_parallel_seize(scan, &blkno, false);
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
@@ -2193,7 +2296,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno);
+				status = _bt_parallel_seize(scan, &blkno, false);
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
@@ -2218,6 +2321,8 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
+	Assert(!so->needPrimScan);
+
 	_bt_initialize_more_data(so, dir);
 
 	if (!_bt_readnextpage(scan, blkno, dir))
@@ -2524,14 +2629,22 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
- * for scan direction
+ * _bt_initialize_more_data() -- initialize moreLeft, moreRight and scan dir
+ * from currPos
  */
 static inline void
 _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 {
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
+	so->currPos.dir = dir;
+	if (so->needPrimScan)
+	{
+		Assert(so->numArrayKeys);
+
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = true;
+		so->needPrimScan = false;
+	}
+	else if (ScanDirectionIsForward(dir))
 	{
 		so->currPos.moreLeft = false;
 		so->currPos.moreRight = true;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d50317096..76395c3f2 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -29,29 +29,77 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+#define LOOK_AHEAD_REQUIRED_RECHECKS 	3
+#define LOOK_AHEAD_DEFAULT_DISTANCE 	5
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *sortproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
 
+typedef struct BTScanKeyPreproc
+{
+	ScanKey		skey;
+	int			ikey;
+	int			arrayidx;
+} BTScanKeyPreproc;
+
+static void _bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+								FmgrInfo *orderproc, FmgrInfo **sortprocp);
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
-									  StrategyNumber strat,
+									  Oid elemtype, StrategyNumber strat,
 									  Datum *elems, int nelems);
-static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-									bool reverse,
-									Datum *elems, int nelems);
+static int	_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc,
+									bool reverse, Datum *elems, int nelems);
+static bool _bt_merge_arrays(IndexScanDesc scan, ScanKey skey,
+							 FmgrInfo *sortproc, bool reverse,
+							 Oid origelemtype, Oid nextelemtype,
+							 Datum *elems_orig, int *nelems_orig,
+							 Datum *elems_next, int nelems_next);
+static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
+										   ScanKey arraysk, ScanKey skey,
+										   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+										   bool *qual_ok);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_trig, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+										 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
+										 bool readpagetup, int sktrig, bool *scanBehind);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+								   int sktrig, bool sktrig_required);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
+									 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool advancenonrequired, bool prechecked, bool firstmatch,
+							  bool *continuescan, int *ikey);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
+static void _bt_check_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
+								 ScanDirection dir, int tupnatts, TupleDesc tupdesc);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -188,29 +236,55 @@ _bt_freestack(BTStack stack)
  *
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
- * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * Return modified scan keys as input for further, standard preprocessing.
  *
- * Note: the reason we need so->arrayKeyData, rather than just scribbling
- * on scan->keyData, is that callers are permitted to call btrescan without
- * supplying a new set of scankey data.
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array element as redundant.  Similarly, we are capable of
+ * "merging together" multiple equality array keys (from two or more input
+ * scan keys) into a single output scan key containing only the intersecting
+ * array elements.  This can eliminate many redundant array elements, as well
+ * as eliminating whole array scan keys as redundant.  It can also allow us to
+ * detect contradictory quals.
+ *
+ * It is convenient for _bt_preprocess_keys caller to have to deal with no
+ * more than one equality strategy array scan key per index attribute.  We'll
+ * always be able to set things up that way when complete opfamilies are used.
+ * Eliminated array scan keys can be recognized as those that have had their
+ * sk_strategy field set to InvalidStrategy here by us.  Caller should avoid
+ * including these in the scan's so->keyData[] output array.
+ *
+ * We set the scan key references from the scan's BTArrayKeyInfo info array to
+ * offsets into the temp modified input array returned to caller.  Scans that
+ * have array keys should call _bt_preprocess_array_keys_final when standard
+ * preprocessing steps are complete.  This will convert the scan key offset
+ * references into references to the scan's so->keyData[] output scan keys.
+ *
+ * Note: the reason we need to return a temp scan key array, rather than just
+ * scribbling on scan->keyData, is that callers are permitted to call btrescan
+ * without supplying a new set of scankey data.
  */
-void
+static ScanKey
 _bt_preprocess_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
 	int			numberOfKeys = scan->numberOfKeys;
-	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16	   *indoption = rel->rd_indoption;
 	int			numArrayKeys;
+	int			origarrayatt = InvalidAttrNumber,
+				origarraykey = -1;
+	Oid			origelemtype = InvalidOid;
 	ScanKey		cur;
-	int			i;
 	MemoryContext oldContext;
+	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+
+	Assert(numberOfKeys);
 
 	/* Quick check to see if there are any array keys */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
 		cur = &scan->keyData[i];
 		if (cur->sk_flags & SK_SEARCHARRAY)
@@ -220,20 +294,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			/* If any arrays are null as a whole, we can quit right now. */
 			if (cur->sk_flags & SK_ISNULL)
 			{
-				so->numArrayKeys = -1;
-				so->arrayKeyData = NULL;
-				return;
+				so->qual_ok = false;
+				return NULL;
 			}
 		}
 	}
 
 	/* Quit if nothing to do. */
 	if (numArrayKeys == 0)
-	{
-		so->numArrayKeys = 0;
-		so->arrayKeyData = NULL;
-		return;
-	}
+		return NULL;
 
 	/*
 	 * Make a scan-lifespan context to hold array-associated data, or reset it
@@ -249,18 +318,23 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	oldContext = MemoryContextSwitchTo(so->arrayContext);
 
 	/* Create modifiable copy of scan->keyData in the workspace context */
-	so->arrayKeyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-	memcpy(so->arrayKeyData,
-		   scan->keyData,
-		   scan->numberOfKeys * sizeof(ScanKeyData));
+	arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
+	memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
 
 	/* Allocate space for per-array data in the workspace context */
-	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
+
+	/* Allocate space for ORDER procs used to help _bt_checkkeys */
+	so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
+		FmgrInfo	sortproc;
+		FmgrInfo   *sortprocp = &sortproc;
+		Oid			elemtype;
+		bool		reverse;
 		ArrayType  *arrayval;
 		int16		elmlen;
 		bool		elmbyval;
@@ -271,7 +345,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			num_nonnulls;
 		int			j;
 
-		cur = &so->arrayKeyData[i];
+		cur = &arrayKeyData[i];
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -305,10 +379,21 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/* If there's no non-nulls, the scan qual is unsatisfiable */
 		if (num_nonnulls == 0)
 		{
-			numArrayKeys = -1;
+			so->qual_ok = false;
 			break;
 		}
 
+		/*
+		 * Determine the nominal datatype of the array elements.  We have to
+		 * support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this is a hack to simplify life for
+		 * ScanKeyInit().
+		 */
+		elemtype = cur->sk_subtype;
+		if (elemtype == InvalidOid)
+			elemtype = rel->rd_opcintype[cur->sk_attno - 1];
+		Assert(elemtype == ARR_ELEMTYPE(arrayval));
+
 		/*
 		 * If the comparison operator is not equality, then the array qual
 		 * degenerates to a simple comparison against the smallest or largest
@@ -319,7 +404,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTLessStrategyNumber:
 			case BTLessEqualStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTGreaterStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -329,7 +414,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTGreaterEqualStrategyNumber:
 			case BTGreaterStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTLessStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -339,17 +424,93 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 				break;
 		}
 
+		/*
+		 * We'll need a 3-way ORDER proc to perform binary searches for the
+		 * next matching array element.  Set that up now.
+		 *
+		 * Array scan keys with cross-type equality operators will require a
+		 * separate same-type ORDER proc for sorting their array.  Otherwise,
+		 * sortproc just points to the same proc used during binary searches.
+		 */
+		_bt_setup_array_cmp(scan, cur, elemtype,
+							&so->orderProcs[i], &sortprocp);
+
 		/*
 		 * Sort the non-null elements and eliminate any duplicates.  We must
 		 * sort in the same ordering used by the index column, so that the
-		 * successive primitive indexscans produce data in index order.
+		 * arrays can be advanced in lockstep with the scan's progress through
+		 * the index's key space.
 		 */
-		num_elems = _bt_sort_array_elements(scan, cur,
-											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+		reverse = (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0;
+		num_elems = _bt_sort_array_elements(cur, sortprocp, reverse,
 											elem_values, num_nonnulls);
 
+		if (origarrayatt == cur->sk_attno)
+		{
+			BTArrayKeyInfo *orig = &so->arrayKeys[origarraykey];
+
+			/*
+			 * This array scan key is redundant with a previous equality
+			 * operator array scan key.  Merge the two arrays together to
+			 * eliminate contradictory non-intersecting elements (or try to).
+			 *
+			 * We merge this next array back into attribute's original array.
+			 */
+			Assert(arrayKeyData[orig->scan_key].sk_attno == cur->sk_attno);
+			Assert(arrayKeyData[orig->scan_key].sk_collation ==
+				   cur->sk_collation);
+			if (_bt_merge_arrays(scan, cur, sortprocp, reverse,
+								 origelemtype, elemtype,
+								 orig->elem_values, &orig->num_elems,
+								 elem_values, num_elems))
+			{
+				/* Successfully eliminated this array */
+				pfree(elem_values);
+
+				/*
+				 * If no intersecting elements remain in the original array,
+				 * the scan qual is unsatisfiable
+				 */
+				if (orig->num_elems == 0)
+				{
+					so->qual_ok = false;
+					break;
+				}
+
+				/*
+				 * Indicate to _bt_preprocess_keys caller that it must ignore
+				 * this scan key
+				 */
+				cur->sk_strategy = InvalidStrategy;
+				continue;
+			}
+
+			/*
+			 * Unable to merge this array with previous array due to a lack of
+			 * suitable cross-type opfamily support.  Will need to keep both
+			 * scan keys/arrays.
+			 */
+		}
+		else
+		{
+			/*
+			 * This array is the first for current index attribute.
+			 *
+			 * If it turns out to not be the last array (that is, if the next
+			 * array is redundantly applied to the same index attribute),
+			 * we'll then treat this array as the attribute's "original" array
+			 * when merging.
+			 */
+			origarrayatt = cur->sk_attno;
+			origarraykey = numArrayKeys;
+			origelemtype = elemtype;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
+		 *
+		 * Note: _bt_preprocess_array_keys_final will fix-up each array's
+		 * scan_key field later on, after so->keyData[] has been finalized.
 		 */
 		so->arrayKeys[numArrayKeys].scan_key = i;
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
@@ -360,6 +521,236 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	so->numArrayKeys = numArrayKeys;
 
 	MemoryContextSwitchTo(oldContext);
+
+	return arrayKeyData;
+}
+
+/*
+ *	_bt_preprocess_array_keys_final() -- fix up array scan key references
+ *
+ * When _bt_preprocess_array_keys performed initial array preprocessing, it
+ * set each array's array->scan_key to the array's arrayKeys[] entry offset
+ * (that also work as references into the original scan->keyData[] array).
+ * This function handles translation of the scan key references from the
+ * BTArrayKeyInfo info array, from input scan key references (to the keys in
+ * scan->keyData[]), into output references (to the keys in so->keyData[]).
+ * Caller's keyDataMap[] array tells us how to perform this remapping.
+ *
+ * Also finalizes so->orderProcs[] for the scan.  Arrays already have an ORDER
+ * proc, which might need to be repositioned to its so->keyData[]-wise offset
+ * (very much like the remapping that we apply to array->scan_key references).
+ * Non-array equality strategy scan keys (that survived preprocessing) don't
+ * yet have an so->orderProcs[] entry, so we set one for them here.
+ *
+ * Also converts single-element array scan keys into equivalent non-array
+ * equality scan keys, which decrements so->numArrayKeys.  It's possible that
+ * this will leave this new btrescan without any arrays at all.  This isn't
+ * necessary for correctness; it's just an optimization.  Non-array equality
+ * scan keys are slightly faster than equivalent array scan keys at runtime.
+ */
+static void
+_bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	int			arrayidx = 0;
+	int			last_equal_output_ikey PG_USED_FOR_ASSERTS_ONLY = -1;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys);
+
+	for (int output_ikey = 0; output_ikey < so->numberOfKeys; output_ikey++)
+	{
+		ScanKey		outkey = so->keyData + output_ikey;
+		int			input_ikey;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+
+		Assert(outkey->sk_strategy != InvalidStrategy);
+
+		if (outkey->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		input_ikey = keyDataMap[output_ikey];
+
+		Assert(last_equal_output_ikey < output_ikey);
+		Assert(last_equal_output_ikey < input_ikey);
+		last_equal_output_ikey = output_ikey;
+
+		/*
+		 * We're lazy about looking up ORDER procs for non-array scan keys,
+		 * since not all input scan keys become output scan keys.  Do it now.
+		 */
+		if (!(outkey->sk_flags & SK_SEARCHARRAY))
+		{
+			Oid			elemtype;
+
+			/* No need for an ORDER proc given an IS NULL scan key */
+			if (outkey->sk_flags & SK_SEARCHNULL)
+				continue;
+
+			elemtype = outkey->sk_subtype;
+			if (elemtype == InvalidOid)
+				elemtype = rel->rd_opcintype[outkey->sk_attno - 1];
+
+			_bt_setup_array_cmp(scan, outkey, elemtype,
+								&so->orderProcs[output_ikey], NULL);
+			continue;
+		}
+
+		/*
+		 * Reorder existing array scan key so->orderProcs[] entries.
+		 *
+		 * Doing this in-place is safe because preprocessing is required to
+		 * output all equality strategy scan keys in original input order
+		 * (among each group of entries against the same index attribute).
+		 * This is also the order that the arrays themselves appear in.
+		 */
+		so->orderProcs[output_ikey] = so->orderProcs[input_ikey];
+
+		/* Fix-up array->scan_key references for arrays */
+		for (; arrayidx < so->numArrayKeys; arrayidx++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
+
+			Assert(array->num_elems > 0);
+
+			if (array->scan_key == input_ikey)
+			{
+				/* found it */
+				array->scan_key = output_ikey;
+				found = true;
+
+				/*
+				 * Transform array scan keys that have exactly 1 element
+				 * remaining (following all prior preprocessing) into
+				 * equivalent non-array scan keys.
+				 */
+				if (array->num_elems == 1)
+				{
+					outkey->sk_flags &= ~SK_SEARCHARRAY;
+					outkey->sk_argument = array->elem_values[0];
+					so->numArrayKeys--;
+
+					/* If we're out of array keys, we can quit right away */
+					if (so->numArrayKeys == 0)
+						return;
+
+					/* Shift other arrays forward */
+					memmove(array, array + 1,
+							sizeof(BTArrayKeyInfo) *
+							(so->numArrayKeys - arrayidx));
+
+					/*
+					 * Don't increment arrayidx (there was an entry that was
+					 * just shifted forward to the offset at arrayidx, which
+					 * will still need to be matched)
+					 */
+				}
+				else
+				{
+					/* Match found, so done with this array */
+					arrayidx++;
+				}
+
+				break;
+			}
+		}
+
+		Assert(found);
+	}
+}
+
+/*
+ * _bt_setup_array_cmp() -- Set up array comparison functions
+ *
+ * Sets ORDER proc in caller's orderproc argument, which is used during binary
+ * searches of arrays during the index scan.  Also sets a same-type ORDER proc
+ * in caller's *sortprocp argument, which is used when sorting the array.
+ *
+ * Preprocessing calls here with all equality strategy scan keys (when scan
+ * uses equality array keys), including those not associated with any array.
+ * See _bt_advance_array_keys for an explanation of why it'll need to treat
+ * simple scalar equality scan keys as degenerate single element arrays.
+ *
+ * Caller should pass an orderproc pointing to space that'll store the ORDER
+ * proc for the scan, and a *sortprocp pointing to its own separate space.
+ * When calling here for a non-array scan key, sortprocp arg should be NULL.
+ *
+ * In the common case where we don't need to deal with cross-type operators,
+ * only one ORDER proc is actually required by caller.  We'll set *sortprocp
+ * to point to the same memory that caller's orderproc continues to point to.
+ * Otherwise, *sortprocp will continue to point to caller's own space.  Either
+ * way, *sortprocp will point to a same-type ORDER proc (since that's the only
+ * safe way to sort/deduplicate the array associated with caller's scan key).
+ */
+static void
+_bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+					FmgrInfo *orderproc, FmgrInfo **sortprocp)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	RegProcedure cmp_proc;
+	Oid			opcintype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
+
+	/*
+	 * If scankey operator is not a cross-type comparison, we can use the
+	 * cached comparison function; otherwise gotta look it up in the catalogs
+	 */
+	if (elemtype == opcintype)
+	{
+		/* Set same-type ORDER procs for caller */
+		*orderproc = *index_getprocinfo(rel, skey->sk_attno, BTORDER_PROC);
+		if (sortprocp)
+			*sortprocp = orderproc;
+
+		return;
+	}
+
+	/*
+	 * Look up the appropriate cross-type comparison function in the opfamily.
+	 *
+	 * Use the opclass input type as the left hand arg type, and the array
+	 * element type as the right hand arg type (since binary searches use an
+	 * index tuple's attribute value to search for a matching array element).
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but only in cases where it's quite likely that _bt_first
+	 * would fail in just the same way (had we not failed before it could).
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 opcintype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, opcintype, elemtype, skey->sk_attno,
+			 RelationGetRelationName(rel));
+
+	/* Set cross-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+
+	/* Done if caller doesn't actually have an array they'll need to sort */
+	if (!sortprocp)
+		return;
+
+	/*
+	 * Look up the appropriate same-type comparison function in the opfamily.
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but it seems quite unlikely that an opfamily would omit
+	 * non-cross-type comparison procs for any datatype that it supports at
+	 * all.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 elemtype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, elemtype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set same-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, *sortprocp, so->arrayContext);
 }
 
 /*
@@ -370,27 +761,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
  * least element, or BTGreaterStrategyNumber to get the greatest.
  */
 static Datum
-_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
+_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey, Oid elemtype,
 						 StrategyNumber strat,
 						 Datum *elems, int nelems)
 {
 	Relation	rel = scan->indexRelation;
-	Oid			elemtype,
-				cmp_op;
+	Oid			cmp_op;
 	RegProcedure cmp_proc;
 	FmgrInfo	flinfo;
 	Datum		result;
 	int			i;
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
 	/*
 	 * Look up the appropriate comparison operator in the opfamily.
 	 *
@@ -399,6 +780,8 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	 * non-cross-type comparison operators for any datatype that it supports
 	 * at all.
 	 */
+	Assert(skey->sk_strategy != BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
 	cmp_op = get_opfamily_member(rel->rd_opfamily[skey->sk_attno - 1],
 								 elemtype,
 								 elemtype,
@@ -433,50 +816,21 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
  * The array elements are sorted in-place, and the new number of elements
  * after duplicate removal is returned.
  *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * skey identifies the index column whose opfamily determines the comparison
+ * semantics, and sortproc is a corresponding ORDER proc.  If reverse is true,
+ * we sort in descending order.
  */
 static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
+_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -487,6 +841,233 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge next array's elements into an original array
+ *
+ * Called when preprocessing encounters a pair of array equality scan keys,
+ * both against the same index attribute (during initial array preprocessing).
+ * Merging reorganizes caller's original array (the left hand arg) in-place,
+ * without ever copying elements from one array into the other. (Mixing the
+ * elements together like this would be wrong, since they don't necessarily
+ * use the same underlying element type, despite all the other similarities.)
+ *
+ * Both arrays must have already been sorted and deduplicated by calling
+ * _bt_sort_array_elements.  sortproc is the same-type ORDER proc that was
+ * just used to sort and deduplicate caller's "next" array.  We'll usually be
+ * able to reuse that order PROC to merge the arrays together now.  If not,
+ * then we'll perform a separate ORDER proc lookup.
+ *
+ * If the opfamily doesn't supply a complete set of cross-type ORDER procs we
+ * may not be able to determine which elements are contradictory.  If we have
+ * the required ORDER proc then we return true (and validly set *nelems_orig),
+ * guaranteeing that at least the next array can be considered redundant.  We
+ * return false if the required comparisons cannot not be made (caller must
+ * keep both arrays when this happens).
+ */
+static bool
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, FmgrInfo *sortproc,
+				 bool reverse, Oid origelemtype, Oid nextelemtype,
+				 Datum *elems_orig, int *nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+	int			nelems_orig_start = *nelems_orig,
+				nelems_orig_merged = 0;
+	FmgrInfo   *mergeproc = sortproc;
+	FmgrInfo	crosstypeproc;
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(origelemtype) && OidIsValid(nextelemtype));
+
+	if (origelemtype != nextelemtype)
+	{
+		RegProcedure cmp_proc;
+
+		/*
+		 * Cross-array-element-type merging is required, so can't just reuse
+		 * sortproc when merging
+		 */
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+									 origelemtype, nextelemtype, BTORDER_PROC);
+		if (!RegProcedureIsValid(cmp_proc))
+		{
+			/* Can't make the required comparisons */
+			return false;
+		}
+
+		/* We have all we need to determine redundancy/contradictoriness */
+		mergeproc = &crosstypeproc;
+		fmgr_info_cxt(cmp_proc, mergeproc, so->arrayContext);
+	}
+
+	cxt.sortproc = mergeproc;
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+
+	for (int i = 0, j = 0; i < nelems_orig_start && j < nelems_next;)
+	{
+		Datum	   *oelem = elems_orig + i,
+				   *nelem = elems_next + j;
+		int			res = _bt_compare_array_elements(oelem, nelem, &cxt);
+
+		if (res == 0)
+		{
+			elems_orig[nelems_orig_merged++] = *oelem;
+			i++;
+			j++;
+		}
+		else if (res < 0)
+			i++;
+		else					/* res > 0 */
+			j++;
+	}
+
+	*nelems_orig = nelems_orig_merged;
+
+	return true;
+}
+
+/*
+ * Compare an array scan key to a scalar scan key, eliminating contradictory
+ * array elements such that the scalar scan key becomes redundant.
+ *
+ * Array elements can be eliminated as contradictory when excluded by some
+ * other operator on the same attribute.  For example, with an index scan qual
+ * "WHERE a IN (1, 2, 3) AND a < 2", all array elements except the value "1"
+ * are eliminated, and the < scan key is eliminated as redundant.  Cases where
+ * every array element is eliminated by a redundant scalar scan key have an
+ * unsatisfiable qual, which we handle by setting *qual_ok=false for caller.
+ *
+ * If the opfamily doesn't supply a complete set of cross-type ORDER procs we
+ * may not be able to determine which elements are contradictory.  If we have
+ * the required ORDER proc then we return true (and validly set *qual_ok),
+ * guaranteeing that at least the scalar scan key can be considered redundant.
+ * We return false if the comparison could not be made (caller must keep both
+ * scan keys when this happens).
+ */
+static bool
+_bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+							   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+							   bool *qual_ok)
+{
+	Relation	rel = scan->indexRelation;
+	Oid			opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
+	int			cmpresult = 0,
+				cmpexact = 0,
+				matchelem,
+				new_nelems = 0;
+	FmgrInfo	crosstypeproc;
+	FmgrInfo   *orderprocp = orderproc;
+
+	Assert(arraysk->sk_attno == skey->sk_attno);
+	Assert(array->num_elems > 0);
+	Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
+		   arraysk->sk_strategy == BTEqualStrategyNumber);
+	Assert(!(skey->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert(!(skey->sk_flags & SK_SEARCHARRAY) ||
+		   skey->sk_strategy != BTEqualStrategyNumber);
+
+	/*
+	 * _bt_binsrch_array_skey is designed to search scan key arrays using
+	 * datums of whatever type the relevant rel opclass uses (on-disk type).
+	 *
+	 * We can reuse the array's ORDER proc whenever the non-array scan key's
+	 * type is a match for the corresponding attribute's input opclass type.
+	 * Otherwise, we have to do another ORDER proc lookup so that our call to
+	 * _bt_binsrch_array_skey applies the correct comparator.
+	 *
+	 * Note: we have to support the convention that sk_subtype == InvalidOid
+	 * means the opclass input type; this is a hack to simplify life for
+	 * ScanKeyInit().
+	 */
+	if (skey->sk_subtype != opcintype && skey->sk_subtype != InvalidOid)
+	{
+		RegProcedure cmp_proc;
+		Oid			arraysk_elemtype;
+
+		/*
+		 * Need an ORDER proc lookup to detect redundancy/contradictoriness
+		 * with this pair of scankeys.
+		 *
+		 * Scalar scan key's argument will be passed to _bt_compare_array_skey
+		 * as its tupdatum/lefthand argument (rhs arg is for array elements).
+		 */
+		arraysk_elemtype = arraysk->sk_subtype;
+		if (arraysk_elemtype == InvalidOid)
+			arraysk_elemtype = rel->rd_opcintype[arraysk->sk_attno - 1];
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[arraysk->sk_attno - 1],
+									 skey->sk_subtype, arraysk_elemtype,
+									 BTORDER_PROC);
+		if (!RegProcedureIsValid(cmp_proc))
+		{
+			/* Can't make the comparison */
+			*qual_ok = false;	/* suppress compiler warnings */
+			return false;
+		}
+
+		/* We have all we need to determine redundancy/contradictoriness */
+		orderprocp = &crosstypeproc;
+		fmgr_info(cmp_proc, orderprocp);
+	}
+
+	matchelem = _bt_binsrch_array_skey(orderprocp, false,
+									   NoMovementScanDirection,
+									   skey->sk_argument, false, array,
+									   arraysk, &cmpresult);
+
+	switch (skey->sk_strategy)
+	{
+		case BTLessStrategyNumber:
+			cmpexact = 1;		/* exclude exact match, if any */
+			/* FALL THRU */
+		case BTLessEqualStrategyNumber:
+			if (cmpresult >= cmpexact)
+				matchelem++;
+			/* Resize, keeping elements from the start of the array */
+			new_nelems = matchelem;
+			break;
+		case BTEqualStrategyNumber:
+			if (cmpresult != 0)
+			{
+				/* qual is unsatisfiable */
+				new_nelems = 0;
+			}
+			else
+			{
+				/* Shift matching element to the start of the array, resize */
+				array->elem_values[0] = array->elem_values[matchelem];
+				new_nelems = 1;
+			}
+			break;
+		case BTGreaterEqualStrategyNumber:
+			cmpexact = 1;		/* include exact match, if any */
+			/* FALL THRU */
+		case BTGreaterStrategyNumber:
+			if (cmpresult >= cmpexact)
+				matchelem++;
+			/* Shift matching elements to the start of the array, resize */
+			new_nelems = array->num_elems - matchelem;
+			memmove(array->elem_values, array->elem_values + matchelem,
+					sizeof(Datum) * new_nelems);
+			break;
+		default:
+			elog(ERROR, "unrecognized StrategyNumber: %d",
+				 (int) skey->sk_strategy);
+			break;
+	}
+
+	Assert(new_nelems >= 0);
+	Assert(new_nelems <= array->num_elems);
+
+	array->num_elems = new_nelems;
+	*qual_ok = new_nelems > 0;
+
+	return true;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -498,7 +1079,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->sortproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -506,11 +1087,235 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to whatever _bt_compare_array_skey returned when
+ * we compared the returned array element to caller's tupdatum argument.  This
+ * helps our caller to determine how advancing its array (to the element we'll
+ * return an offset to) might need to carry to higher order arrays.
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this array's
+ * scan key, and that the array is for a required scan key.  We can apply this
+ * information to find the next matching array element in the current scan
+ * direction using far fewer comparisons (fewer on average, compared to naive
+ * binary search).  This scheme takes advantage of an important property of
+ * required arrays: required arrays always advance in lockstep with the index
+ * scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_trig, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem = 0,
+				mid_elem = -1,
+				high_elem = array->num_elems - 1,
+				result = 0;
+	Datum		arrdatum;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur_elem_trig)
+	{
+		Assert(!ScanDirectionIsNoMovement(dir));
+		Assert(cur->sk_flags & SK_BT_REQFWD);
+
+		/*
+		 * When the scan key that triggered array advancement is a required
+		 * array scan key, it is now certain that the current array element
+		 * (plus all prior elements relative to the current scan direction)
+		 * cannot possibly be at or ahead of the corresponding tuple value.
+		 * (_bt_checkkeys must have called _bt_tuple_before_array_skeys, which
+		 * makes sure this is true as a condition of advancing the arrays.)
+		 *
+		 * This makes it safe to exclude array elements up to and including
+		 * the former-current array element from our search.
+		 *
+		 * Separately, when array advancement was triggered by a required scan
+		 * key, the array element immediately after the former-current element
+		 * is often either an exact tupdatum match, or a "close by" near-match
+		 * (a near-match tupdatum is one whose key space falls _between_ the
+		 * former-current and new-current array elements).  We'll detect both
+		 * cases via an optimistic comparison of the new search lower bound
+		 * (or new search upper bound in the case of backwards scans).
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			low_elem = array->cur_elem + 1; /* old cur_elem exhausted */
+
+			/* Compare prospective new cur_elem (also the new lower bound) */
+			if (high_elem >= low_elem)
+			{
+				arrdatum = array->elem_values[low_elem];
+				result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+												arrdatum, cur);
+
+				if (result <= 0)
+				{
+					/* Optimistic comparison optimization worked out */
+					*set_elem_result = result;
+					return low_elem;
+				}
+				mid_elem = low_elem;
+				low_elem++;		/* this cur_elem exhausted, too */
+			}
+
+			if (high_elem < low_elem)
+			{
+				/* Caller needs to perform "beyond end" array advancement */
+				*set_elem_result = 1;
+				return high_elem;
+			}
+		}
+		else
+		{
+			high_elem = array->cur_elem - 1;	/* old cur_elem exhausted */
+
+			/* Compare prospective new cur_elem (also the new upper bound) */
+			if (high_elem >= low_elem)
+			{
+				arrdatum = array->elem_values[high_elem];
+				result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+												arrdatum, cur);
+
+				if (result >= 0)
+				{
+					/* Optimistic comparison optimization worked out */
+					*set_elem_result = result;
+					return high_elem;
+				}
+				mid_elem = high_elem;
+				high_elem--;	/* this cur_elem exhausted, too */
+			}
+
+			if (high_elem < low_elem)
+			{
+				/* Caller needs to perform "beyond end" array advancement */
+				*set_elem_result = -1;
+				return low_elem;
+			}
+		}
+	}
+
+	while (high_elem > low_elem)
+	{
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * It's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the low_elem datum.  Must always set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
  * Set up the cur_elem counters and fill in the first sk_argument value for
- * each array scankey.  We can't do this until we know the scan direction.
+ * each array scankey.
  */
 void
 _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -518,159 +1323,1164 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			i;
 
+	Assert(so->numArrayKeys);
+	Assert(so->qual_ok);
+
 	for (i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 
 		Assert(curArrayKey->num_elems > 0);
+		Assert(skey->sk_flags & SK_SEARCHARRAY);
+
 		if (ScanDirectionIsBackward(dir))
 			curArrayKey->cur_elem = curArrayKey->num_elems - 1;
 		else
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
+	so->scanBehind = false;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 		int			cur_elem = curArrayKey->cur_elem;
 		int			num_elems = curArrayKey->num_elems;
+		bool		rolled = false;
 
-		if (ScanDirectionIsBackward(dir))
+		if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
 		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = 0;
+			rolled = true;
 		}
-		else
+		else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
 		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = num_elems - 1;
+			rolled = true;
 		}
 
 		curArrayKey->cur_elem = cur_elem;
 		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
+		if (!rolled)
+			return true;
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+		/* Need to advance next array key, if any */
+	}
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * The array keys are now exhausted.  (There isn't actually a distinct
+	 * state that represents array exhaustion, since index scans don't always
+	 * end after btgettuple returns "false".)
+	 *
+	 * Restore the array keys to the state they were in immediately before we
+	 * were called.  This ensures that the arrays only ever ratchet in the
+	 * current scan direction.  Without this, scans would overlook matching
+	 * tuples if and when the scan's direction was subsequently reversed.
 	 */
-	if (!found)
-		so->arraysStarted = false;
+	_bt_start_array_keys(scan, -dir);
 
-	return found;
+	return false;
 }
 
 /*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
+ * _bt_rewind_nonrequired_arrays() -- Rewind non-required arrays
  *
- * Save the current state of the array keys as the "mark" position.
+ * Called when _bt_advance_array_keys decides to start a new primitive index
+ * scan on the basis of the current scan position being before the position
+ * that _bt_first is capable of repositioning the scan to by applying an
+ * inequality operator required in the opposite-to-scan direction only.
+ *
+ * Although equality strategy scan keys (for both arrays and non-arrays alike)
+ * are either marked required in both directions or in neither direction,
+ * there is a sense in which non-required arrays behave like required arrays.
+ * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
+ * the scan key on "c" is non-required, but nevertheless enables positioning
+ * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
+ * first descent of the tree by _bt_first.  Later on, there could also be a
+ * second descent, that places the scan right before tuples >= "(200, 3, 5)".
+ * _bt_first must never be allowed to build an insertion scan key whose "c"
+ * entry is set to a value other than 5, the "c" array's first element/value.
+ * (Actually, it's the first in the current scan direction.  This example uses
+ * a forward scan.)
+ *
+ * Calling here resets the array scan key elements for the scan's non-required
+ * arrays.  This is strictly necessary for correctness in a subset of cases
+ * involving "required in opposite direction"-triggered primitive index scans.
+ * Not all callers are at risk of _bt_first using a non-required array like
+ * this, but advancement always resets the arrays when another primitive scan
+ * is scheduled, just to keep things simple.  Array advancement even makes
+ * sure to reset non-required arrays during scans that have no inequalities.
+ * (Advancement still won't call here when there are no inequalities, though
+ * that's just because it's all handled indirectly instead.)
+ *
+ * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
+ * everybody got this right.
  */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
+static void
+_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
+	int			arrayidx = 0;
 
-	for (i = 0; i < so->numArrayKeys; i++)
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		Assert(array->scan_key == ikey);
+
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+		{
+			array->cur_elem = first_elem_dir;
+			cur->sk_argument = array->elem_values[first_elem_dir];
+		}
 	}
 }
 
 /*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
+ * _bt_tuple_before_array_skeys() -- determine if tuple advances array keys
  *
- * Restore the array keys to where they were when the mark was set.
+ * We always compare the tuple using the current array keys (which we assume
+ * are already set in so->keyData[]).  readpagetup indicates if tuple is the
+ * scan's current _bt_readpage-wise tuple.
+ *
+ * readpagetup callers must only call here when _bt_check_compare already set
+ * continuescan=false.  We help these callers deal with _bt_check_compare's
+ * inability to distinguishing between the < and > cases (it uses equality
+ * operator scan keys, whereas we use 3-way ORDER procs).  These callers pass
+ * a _bt_check_compare-set sktrig value that indicates which scan key
+ * triggered the call (!readpagetup callers just pass us sktrig=0 instead).
+ * This information allows us to avoid wastefully checking earlier scan keys
+ * that were already deemed to have been satisfied inside _bt_check_compare.
+ *
+ * Returns false when caller's tuple is >= the current required equality scan
+ * keys (or <=, in the case of backwards scans).  This happens to readpagetup
+ * callers when the scan has reached the point of needing its array keys
+ * advanced; caller will need to advance required and non-required arrays at
+ * scan key offsets >= sktrig, plus scan keys < sktrig iff sktrig rolls over.
+ * (When we return false to readpagetup callers, tuple can only be == current
+ * required equality scan keys when caller's sktrig indicates that the arrays
+ * need to be advanced due to an unsatisfied required inequality key trigger.)
+ *
+ * Returns true when caller passes a tuple that is < the current set of
+ * equality keys for the most significant non-equal required scan key/column
+ * (or > the keys, during backwards scans).  This happens to readpagetup
+ * callers when tuple is still before the start of matches for the scan's
+ * required equality strategy scan keys.  (sktrig can't have indicated that an
+ * inequality strategy scan key wasn't satisfied in _bt_check_compare when we
+ * return true.  In fact, we automatically return false when passed such an
+ * inequality sktrig by readpagetup callers -- _bt_check_compare's initial
+ * continuescan=false doesn't really need to be confirmed here by us.)
+ *
+ * readpagetup callers shouldn't call here for unsatisfied non-required array
+ * scan keys, since _bt_check_compare is capable of handling those on its own
+ * (non-required array advancement can never roll over to higher order arrays,
+ * and so never affects required arrays, and so can't affect our answer).
+ *
+ * !readpagetup callers optionally pass us *scanBehind, which tracks whether
+ * any missing truncated attributes might have affected array advancement
+ * (compared to what would happen if it was shown the first non-pivot tuple on
+ * the page to the right of caller's finaltup/high key tuple instead).  It's
+ * only possible that we'll set *scanBehind to true when caller passes us a
+ * pivot tuple (with truncated attributes) that we return false for.
  */
-void
-_bt_restore_array_keys(IndexScanDesc scan)
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+							 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
+							 bool readpagetup, int sktrig, bool *scanBehind)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		changed = false;
-	int			i;
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->numArrayKeys);
+	Assert(so->numberOfKeys);
+	Assert(sktrig == 0 || readpagetup);
+	Assert(!readpagetup || scanBehind == NULL);
+
+	if (scanBehind)
+		*scanBehind = false;
+
+	for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		ScanKey		cur = so->keyData + ikey;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		/* readpagetup calls require one ORDER proc comparison (at most) */
+		Assert(!readpagetup || ikey == sktrig);
+
+		/*
+		 * Once we reach a non-required scan key, we're completely done.
+		 *
+		 * Note: we deliberately don't consider the scan direction here.
+		 * _bt_advance_array_keys caller requires that we track *scanBehind
+		 * without concern for scan direction.
+		 */
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) == 0)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
-			changed = true;
+			Assert(!readpagetup);
+			Assert(ikey > sktrig || ikey == 0);
+			return false;
+		}
+
+		if (cur->sk_attno > tupnatts)
+		{
+			Assert(!readpagetup);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys (but set *scanBehind to let interested callers know
+			 * that a truncated attribute might have affected our answer).
+			 */
+			if (scanBehind)
+				*scanBehind = true;
+
+			return false;
+		}
+
+		/*
+		 * Deal with inequality strategy scan keys that _bt_check_compare set
+		 * continuescan=false for
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+		{
+			/*
+			 * When _bt_check_compare indicated that a required inequality
+			 * scan key wasn't satisfied, there's no need to verify anything;
+			 * caller always calls _bt_advance_array_keys with this sktrig.
+			 */
+			if (readpagetup)
+				return false;
+
+			/*
+			 * Otherwise we can't give up, since we must check all required
+			 * scan keys (required in either direction) in order to correctly
+			 * track *scanBehind for caller
+			 */
+			continue;
+		}
+
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		result = _bt_compare_array_skey(&so->orderProcs[ikey],
+										tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?  (Must be if we get here during a readpagetup call.)
+		 */
+		if (readpagetup || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconclusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or a call made from an assertion.
+		 */
+		Assert(result == 0);
+	}
+
+	Assert(!readpagetup);
+
+	return false;
+}
+
+/*
+ * _bt_start_prim_scan() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys,
+ * after _bt_first or _bt_next return false.
+ */
+bool
+_bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+
+	/* scanBehind flag doesn't persist across primitive index scans - reset */
+	so->scanBehind = false;
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  The flag tells us what to do.
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys.  There are
+	 * various cases where that won't happen.  For example, if the index is
+	 * completely empty, then _bt_first won't call _bt_readpage/_bt_checkkeys.
+	 * We also don't expect a call to _bt_checkkeys during searches for a
+	 * non-existent value that happens to be lower/higher than any existing
+	 * value in the index.
+	 *
+	 * We don't require special handling for these cases -- we don't need to
+	 * be explicitly instructed to _not_ perform another primitive index scan.
+	 * It's up to code under the control of _bt_first to always set the flag
+	 * when another primitive index scan will be required.
+	 *
+	 * This works correctly, even with the tricky cases listed above, which
+	 * all involve access to leaf pages "near the boundaries of the key space"
+	 * (whether it's from a leftmost/rightmost page, or an imaginary empty
+	 * leaf root page).  If _bt_checkkeys cannot be reached by a primitive
+	 * index scan for one set of array keys, then it also won't be reached for
+	 * any later set ("later" in terms of the direction that we scan the index
+	 * and advance the arrays).  The array keys won't have advanced in these
+	 * cases, but that's the correct behavior (even _bt_advance_array_keys
+	 * won't always advance the arrays at the point they become "exhausted").
+	 */
+	if (so->needPrimScan)
+	{
+		Assert(_bt_verify_arrays_bt_first(scan, dir));
+
+		/*
+		 * Flag was set -- must call _bt_first again, which will reset the
+		 * scan's needPrimScan flag
+		 */
+		return true;
+	}
+
+	/* The top-level index scan ran out of tuples in this scan direction */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * The scan always gets a new qual as a consequence of calling here (except
+ * when we determine that the top-level scan has run out of matching tuples).
+ * All later _bt_check_compare calls also use the same new qual that was first
+ * used here (at least until the next call here advances the keys once again).
+ * It's convenient to structure _bt_check_compare rechecks of caller's tuple
+ * (using the new qual) as one the steps of advancing the scan's array keys,
+ * so this function works as a wrapper around _bt_check_compare.
+ *
+ * Like _bt_check_compare, we'll set pstate.continuescan on behalf of the
+ * caller, and return a boolean indicating if caller's tuple satisfies the
+ * scan's new qual.  But unlike _bt_check_compare, we set so->needPrimScan
+ * when we set continuescan=false, indicating if a new primitive index scan
+ * has been scheduled (otherwise, the top-level scan has run out of tuples in
+ * the current scan direction).
+ *
+ * Caller must use _bt_tuple_before_array_skeys to determine if the current
+ * place in the scan is >= the current array keys _before_ calling here.
+ * We're responsible for ensuring that caller's tuple is <= the newly advanced
+ * required array keys once we return.  We try to find an exact match, but
+ * failing that we'll advance the array keys to whatever set of array elements
+ * comes next in the key space for the current scan direction.  Required array
+ * keys "ratchet forwards" (or backwards).  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The rules are the same for backwards scans, except that the operators are
+ * flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Callers whose
+ * sktrig scan key is non-required specify sktrig_required=false.  These calls
+ * are the only exception to the general rule about always advancing the
+ * required array keys (the scan may not even have a required array).  These
+ * callers should just pass a NULL pstate (since there is never any question
+ * of stopping the scan).  No call to _bt_tuple_before_array_skeys is required
+ * ahead of these calls (it's already clear that any required scan keys must
+ * be satisfied by caller's tuple).
+ *
+ * Note that we deal with non-array required equality strategy scan keys as
+ * degenerate single element arrays here.  Obviously, they can never really
+ * advance in the way that real arrays can, but they must still affect how we
+ * advance real array scan keys (exactly like true array equality scan keys).
+ * We have to keep around a 3-way ORDER proc for these (using the "=" operator
+ * won't do), since in general whether the tuple is < or > _any_ unsatisfied
+ * required equality key influences how the scan's real arrays must advance.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing required array keys (and other required equality keys) are already
+ * an exact match for every corresponding value from caller's tuple.  We must
+ * do this for inequalities that _bt_check_compare set continuescan=false for.
+ * They'll advance the array keys here, just like any other scan key that
+ * _bt_check_compare stops on.  (This can even happen _after_ we advance the
+ * array keys, in which case we'll advance the array keys a second time.  That
+ * way _bt_checkkeys caller always has its required arrays advance to the
+ * maximum possible extent that its tuple will allow.)
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+					   int sktrig, bool sktrig_required)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate ? pstate->dir : ForwardScanDirection;
+	int			arrayidx = 0;
+	bool		beyond_end_advance = false,
+				has_required_opposite_direction_only = false,
+				oppodir_inequality_sktrig = false,
+				all_required_satisfied = true,
+				all_satisfied = true;
+
+	if (sktrig_required)
+	{
+		/*
+		 * Precondition array state assertions
+		 */
+		Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
+											 tupnatts, false, 0, NULL));
+
+		so->scanBehind = false; /* reset */
+
+		/*
+		 * Required scan key wasn't satisfied, so required arrays will have to
+		 * advance.  Invalidate page-level state that tracks whether the
+		 * scan's required-in-opposite-direction-only keys are known to be
+		 * satisfied by page's remaining tuples.
+		 */
+		pstate->firstmatch = false;
+
+		/* Shouldn't have to invalidate 'prechecked', though */
+		Assert(!pstate->prechecked);
+
+		/*
+		 * Once we return we'll have a new set of required array keys, whose
+		 * "tuple before array keys" recheck counter should start from 0.
+		 *
+		 * Note that we deliberately avoid touching targetdistance, since
+		 * that's still considered representative of the page as a whole.
+		 */
+		pstate->rechecks = 0;
+	}
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		Datum		tupdatum;
+		bool		required = false,
+					required_opposite_direction_only = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			/* Manage array state */
+			if (cur->sk_flags & SK_SEARCHARRAY)
+			{
+				array = &so->arrayKeys[arrayidx++];
+				Assert(array->scan_key == ikey);
+			}
+		}
+		else
+		{
+			/*
+			 * Are any inequalities required in the opposite direction only
+			 * present here?
+			 */
+			if (((ScanDirectionIsForward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQBKWD))) ||
+				 (ScanDirectionIsBackward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQFWD)))))
+				has_required_opposite_direction_only =
+					required_opposite_direction_only = true;
+		}
+
+		/* Optimization: skip over known-satisfied scan keys */
+		if (ikey < sktrig)
+			continue;
+
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			Assert(sktrig_required);
+
+			required = true;
+
+			if (cur->sk_attno > tupnatts)
+			{
+				/* Set this just like _bt_tuple_before_array_skeys */
+				Assert(sktrig < ikey);
+				so->scanBehind = true;
+			}
+		}
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now.  (The same underlying principle also gets
+		 * applied by the cur_elem_trig optimization used to speed up searches
+		 * for the next array element.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; we aren't capable of directly
+		 * evaluating required inequality strategy scan keys here, on our own.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(sktrig_required && required && all_required_satisfied);
+
+			/* Use "beyond end" advancement.  See below for an explanation. */
+			beyond_end_advance = true;
+			all_satisfied = all_required_satisfied = false;
+
+			/*
+			 * Set a flag that remembers that this was an inequality required
+			 * in the opposite scan direction only, that nevertheless
+			 * triggered the call here.
+			 *
+			 * This only happens when an inequality operator (which must be
+			 * strict) encounters a group of NULLs that indicate the end of
+			 * non-NULL values for tuples in the current scan direction.
+			 */
+			if (unlikely(required_opposite_direction_only))
+				oppodir_inequality_sktrig = true;
+
+			continue;
+		}
+
+		/*
+		 * Nothing more for us to do with an inequality strategy scan key that
+		 * wasn't the one that _bt_check_compare stopped on, though.
+		 *
+		 * Note: if our later call to _bt_check_compare (to recheck caller's
+		 * tuple) sets continuescan=false due to finding this same inequality
+		 * unsatisfied (possible when it's required in the scan direction),
+		 * we'll deal with it via a recursive "second pass" call.
+		 */
+		else if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Nothing for us to do with an equality strategy scan key that isn't
+		 * marked required, either -- unless it's a non-required array
+		 */
+		else if (!required && !array)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				cur->sk_argument = array->elem_values[final_elem_dir];
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_required_satisfied || cur->sk_attno > tupnatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				cur->sk_argument = array->elem_values[first_elem_dir];
+			}
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		cur_elem_trig = (sktrig_required && ikey == sktrig);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+											  cur_elem_trig, dir,
+											  tupdatum, tupnull, array, cur,
+											  &result);
+
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(sktrig_required && required);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single element array.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(&so->orderProcs[ikey],
+											tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling for required scan keys that lack
+		 * a real array to advance, nor for redundant scan keys that couldn't
+		 * be eliminated by _bt_preprocess_keys.  It won't matter if some of
+		 * our "true" array scan keys (or even all of them) are non-required.
+		 */
+		if (required &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		Assert(all_required_satisfied && all_satisfied);
+		if (result != 0)
+		{
+			/*
+			 * Track whether caller's tuple satisfies our new post-advancement
+			 * qual, for required scan keys, as well as for the entire set of
+			 * interesting scan keys (all required scan keys plus non-required
+			 * array scan keys are considered interesting.)
+			 */
+			all_satisfied = false;
+			if (required)
+				all_required_satisfied = false;
+			else
+			{
+				/*
+				 * There's no need to advance the arrays using the best
+				 * available match for a non-required array.  Give up now.
+				 * (Though note that sktrig_required calls still have to do
+				 * all the usual post-advancement steps, including the recheck
+				 * call to _bt_check_compare.)
+				 */
+				break;
+			}
+		}
+
+		/* Advance array keys, even when set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			cur->sk_argument = array->elem_values[set_elem];
 		}
 	}
 
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is how top-level index
+	 * scans usually determine that they've run out of tuples to return in the
+	 * current scan direction (less often the top-level scan just runs out of
+	 * tuples/pages before the scan's array keys are exhausted).
 	 */
-	if (changed || !so->arraysStarted)
-	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
-}
+	if (beyond_end_advance && !_bt_advance_array_keys_increment(scan, dir))
+		goto end_toplevel_scan;
 
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	/*
+	 * Does tuple now satisfy our new qual?  Recheck with _bt_check_compare.
+	 *
+	 * Calls triggered by an unsatisfied required scan key, whose tuple now
+	 * satisfies all required scan keys, but not all nonrequired array keys,
+	 * will still require a recheck call to _bt_check_compare.  They'll still
+	 * need its "second pass" handling of required inequality scan keys.
+	 * (Might have missed a still-unsatisfied required inequality scan key
+	 * that caller didn't detect as the sktrig scan key during its initial
+	 * _bt_check_compare call that used the old/original qual.)
+	 *
+	 * Calls triggered by an unsatisfied nonrequired array scan key never need
+	 * "second pass" handling of required inequalities (nor any other handling
+	 * of any required scan key).  All that matters is whether caller's tuple
+	 * satisfies the new qual, so it's safe to just skip the _bt_check_compare
+	 * recheck when we've already determined that it can only return 'false'.
+	 */
+	if ((sktrig_required && all_required_satisfied) ||
+		(!sktrig_required && all_satisfied))
+	{
+		int			nsktrig = sktrig + 1;
+		bool		continuescan;
+
+		Assert(all_required_satisfied);
+
+		/* Recheck _bt_check_compare on behalf of caller */
+		if (_bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+							  false, false, false,
+							  &continuescan, &nsktrig) &&
+			!so->scanBehind)
+		{
+			/* This tuple satisfies the new qual */
+			Assert(all_satisfied && continuescan);
+
+			if (pstate)
+				pstate->continuescan = true;
+
+			return true;
+		}
+
+		/*
+		 * Consider "second pass" handling of required inequalities.
+		 *
+		 * It's possible that our _bt_check_compare call indicated that the
+		 * scan should end due to some unsatisfied inequality that wasn't
+		 * initially recognized as such by us.  Handle this by calling
+		 * ourselves recursively, this time indicating that the trigger is the
+		 * inequality that we missed first time around (and using a set of
+		 * required array/equality keys that are now exact matches for tuple).
+		 *
+		 * We make a strong, general guarantee that every _bt_checkkeys call
+		 * here will advance the array keys to the maximum possible extent
+		 * that we can know to be safe based on caller's tuple alone.  If we
+		 * didn't perform this step, then that guarantee wouldn't quite hold.
+		 */
+		if (unlikely(!continuescan))
+		{
+			bool		satisfied PG_USED_FOR_ASSERTS_ONLY;
+
+			Assert(sktrig_required);
+			Assert(so->keyData[nsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * The tuple must use "beyond end" advancement during the
+			 * recursive call, so we cannot possibly end up back here when
+			 * recursing.  We'll consume a small, fixed amount of stack space.
+			 */
+			Assert(!beyond_end_advance);
+
+			/* Advance the array keys a second time using same tuple */
+			satisfied = _bt_advance_array_keys(scan, pstate, tuple, tupnatts,
+											   tupdesc, nsktrig, true);
+
+			/* This tuple doesn't satisfy the inequality */
+			Assert(!satisfied);
+			return false;
+		}
+
+		/*
+		 * Some non-required scan key (from new qual) still not satisfied.
+		 *
+		 * All scan keys required in the current scan direction must still be
+		 * satisfied, though, so we can trust all_required_satisfied below.
+		 */
+	}
+
+	/*
+	 * When we were called just to deal with "advancing" non-required arrays,
+	 * this is as far as we can go (cannot stop the scan for these callers)
+	 */
+	if (!sktrig_required)
+	{
+		/* Caller's tuple doesn't match any qual */
+		return false;
+	}
+
+	/*
+	 * Postcondition array state assertions (for still-unsatisfied tuples).
+	 *
+	 * By here we have established that the scan's required arrays (scan must
+	 * have at least one required array) advanced, without becoming exhausted.
+	 *
+	 * Caller's tuple is now < the newly advanced array keys (or > when this
+	 * is a backwards scan), except in the case where we only got this far due
+	 * to an unsatisfied non-required scan key.  Verify that with an assert.
+	 *
+	 * Note: we don't just quit at this point when all required scan keys were
+	 * found to be satisfied because we need to consider edge-cases involving
+	 * scan keys required in the opposite direction only; those aren't tracked
+	 * by all_required_satisfied. (Actually, oppodir_inequality_sktrig trigger
+	 * scan keys are tracked by all_required_satisfied, since it's convenient
+	 * for _bt_check_compare to behave as if they are required in the current
+	 * scan direction to deal with NULLs.  We'll account for that separately.)
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts,
+										false, 0, NULL) ==
+		   !all_required_satisfied);
+
+	/*
+	 * We generally permit primitive index scans to continue onto the next
+	 * sibling page when the page's finaltup satisfies all required scan keys
+	 * at the point where we're between pages.
+	 *
+	 * If caller's tuple is also the page's finaltup, and we see that required
+	 * scan keys still aren't satisfied, start a new primitive index scan.
+	 */
+	if (!all_required_satisfied && pstate->finaltup == tuple)
+		goto new_prim_scan;
+
+	/*
+	 * Proactively check finaltup (don't wait until finaltup is reached by the
+	 * scan) when it might well turn out to not be satisfied later on.
+	 *
+	 * Note: if so->scanBehind hasn't already been set for finaltup by us,
+	 * it'll be set during this call to _bt_tuple_before_array_skeys.  Either
+	 * way, it'll be set correctly (for the whole page) after this point.
+	 */
+	if (!all_required_satisfied && pstate->finaltup &&
+		_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
+									 BTreeTupleGetNAtts(pstate->finaltup, rel),
+									 false, 0, &so->scanBehind))
+		goto new_prim_scan;
+
+	/*
+	 * When we encounter a truncated finaltup high key attribute, we're
+	 * optimistic about the chances of its corresponding required scan key
+	 * being satisfied when we go on to check it against tuples from this
+	 * page's right sibling leaf page.  We consider truncated attributes to be
+	 * satisfied by required scan keys, which allows the primitive index scan
+	 * to continue to the next leaf page.  We must set so->scanBehind to true
+	 * to remember that the last page's finaltup had "satisfied" required scan
+	 * keys for one or more truncated attribute values (scan keys required in
+	 * _either_ scan direction).
+	 *
+	 * There is a chance that _bt_checkkeys (which checks so->scanBehind) will
+	 * find that even the sibling leaf page's finaltup is < the new array
+	 * keys.  When that happens, our optimistic policy will have incurred a
+	 * single extra leaf page access that could have been avoided.
+	 *
+	 * A pessimistic policy would give backward scans a gratuitous advantage
+	 * over forward scans.  We'd punish forward scans for applying more
+	 * accurate information from the high key, rather than just using the
+	 * final non-pivot tuple as finaltup, in the style of backward scans.
+	 * Being pessimistic would also give some scans with non-required arrays a
+	 * perverse advantage over similar scans that use required arrays instead.
+	 *
+	 * You can think of this as a speculative bet on what the scan is likely
+	 * to find on the next page.  It's not much of a gamble, though, since the
+	 * untruncated prefix of attributes must strictly satisfy the new qual
+	 * (though it's okay if any non-required scan keys fail to be satisfied).
+	 */
+	if (so->scanBehind && has_required_opposite_direction_only)
+	{
+		/*
+		 * However, we avoid this behavior whenever the scan involves a scan
+		 * key required in the opposite direction to the scan only, along with
+		 * a finaltup with at least one truncated attribute that's associated
+		 * with a scan key marked required (required in either direction).
+		 *
+		 * _bt_check_compare simply won't stop the scan for a scan key that's
+		 * marked required in the opposite scan direction only.  That leaves
+		 * us without any reliable way of reconsidering any opposite-direction
+		 * inequalities if it turns out that starting a new primitive index
+		 * scan will allow _bt_first to skip ahead by a great many leaf pages
+		 * (see next section for details of how that works).
+		 */
+		goto new_prim_scan;
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 * They can also signal that we should start a new primitive index scan.
+	 *
+	 * It's possible that the scan is now positioned where "matching" tuples
+	 * begin, and that caller's tuple satisfies all scan keys required in the
+	 * current scan direction.  But if caller's tuple still doesn't satisfy
+	 * other scan keys that are required in the opposite scan direction only
+	 * (e.g., a required >= strategy scan key when scan direction is forward),
+	 * it's still possible that there are many leaf pages before the page that
+	 * _bt_first could skip straight to.  Groveling through all those pages
+	 * will always give correct answers, but it can be very inefficient.  We
+	 * must avoid needlessly scanning extra pages.
+	 *
+	 * Separately, it's possible that _bt_check_compare set continuescan=false
+	 * for a scan key that's required in the opposite direction only.  This is
+	 * a special case, that happens only when _bt_check_compare sees that the
+	 * inequality encountered a NULL value.  This signals the end of non-NULL
+	 * values in the current scan direction, which is reason enough to end the
+	 * (primitive) scan.  If this happens at the start of a large group of
+	 * NULL values, then we shouldn't expect to be called again until after
+	 * the scan has already read indefinitely-many leaf pages full of tuples
+	 * with NULL suffix values.  We need a separate test for this case so that
+	 * we don't miss our only opportunity to skip over such a group of pages.
+	 *
+	 * Apply a test against finaltup to detect and recover from the problem:
+	 * if even finaltup doesn't satisfy such an inequality, we just skip by
+	 * starting a new primitive index scan.  When we skip, we know for sure
+	 * that all of the tuples on the current page following caller's tuple are
+	 * also before the _bt_first-wise start of tuples for our new qual.  That
+	 * at least suggests many more skippable pages beyond the current page.
+	 */
+	if (has_required_opposite_direction_only && pstate->finaltup &&
+		(all_required_satisfied || oppodir_inequality_sktrig))
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped;
+		bool		continuescanflip;
+		int			opsktrig;
+
+		/*
+		 * We're checking finaltup (which is usually not caller's tuple), so
+		 * cannot reuse work from caller's earlier _bt_check_compare call.
+		 *
+		 * Flip the scan direction when calling _bt_check_compare this time,
+		 * so that it will set continuescanflip=false when it encounters an
+		 * inequality required in the opposite scan direction.
+		 */
+		Assert(!so->scanBehind);
+		opsktrig = 0;
+		flipped = -dir;
+		_bt_check_compare(scan, flipped,
+						  pstate->finaltup, nfinaltupatts, tupdesc,
+						  false, false, false,
+						  &continuescanflip, &opsktrig);
+
+		/*
+		 * If we ended up here due to the all_required_satisfied criteria,
+		 * test opsktrig in a way that ensures that finaltup contains the same
+		 * prefix of key columns as caller's tuple (a prefix that satisfies
+		 * earlier required-in-current-direction scan keys).
+		 *
+		 * If we ended up here due to the oppodir_inequality_sktrig criteria,
+		 * test opsktrig in a way that ensures that the same scan key that our
+		 * caller found to be unsatisfied (by the scan's tuple) was also the
+		 * one unsatisfied just now (by finaltup).  That way we'll only start
+		 * a new primitive scan when we're sure that both tuples _don't_ share
+		 * the same prefix of satisfied equality-constrained attribute values,
+		 * and that finaltup has a non-NULL attribute value indicated by the
+		 * unsatisfied scan key at offset opsktrig/sktrig.  (This depends on
+		 * _bt_check_compare not caring about the direction that inequalities
+		 * are required in whenever NULL attribute values are unsatisfied.  It
+		 * only cares about the scan direction, and its relationship to
+		 * whether NULLs are stored first or last relative to non-NULLs.)
+		 */
+		Assert(all_required_satisfied != oppodir_inequality_sktrig);
+		if (unlikely(!continuescanflip &&
+					 ((all_required_satisfied && opsktrig > sktrig) ||
+					  (oppodir_inequality_sktrig && opsktrig >= sktrig))))
+		{
+			Assert(so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * Make sure that any non-required arrays are set to the first
+			 * array element for the current scan direction
+			 */
+			_bt_rewind_nonrequired_arrays(scan, dir);
+
+			goto new_prim_scan;
+		}
+	}
+
+	/*
+	 * Stick with the ongoing primitive index scan for now.
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will quickly discover the next tuple covered
+	 * by the current array keys.  Repeated binary searches of the page (one
+	 * binary search per array advancement) is unlikely to outperform one
+	 * continuous linear search of the whole page.
+	 *
+	 * Note: the linear search process has a "look ahead" mechanism that
+	 * allows _bt_checkkeys to detect cases where the page contains an
+	 * excessive number of "before array key" tuples.  If there is a large
+	 * group of non-matching tuples (tuples located before the key space where
+	 * we expect to find the first tuple matching the new array keys), then
+	 * _bt_readpage is instructed to skip over many of those tuples.
+	 */
+	pstate->continuescan = true;	/* Override _bt_check_compare */
+	so->needPrimScan = false;	/* _bt_readpage has more tuples to check */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+new_prim_scan:
+
+	/*
+	 * End this primitive index scan, but schedule another.
+	 *
+	 * Note: If the scan direction happens to change, this scheduled primitive
+	 * index scan won't go ahead after all.
+	 */
+	if (!scan->parallel_scan ||
+		_bt_parallel_primscan_schedule(scan, pstate->prev_scan_page))
+	{
+		pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+		so->needPrimScan = true;	/* ...but call _bt_first again */
+	}
+	else
+	{
+		/*
+		 * Parallel index scan worker failed to schedule another primitive
+		 * index scan.  Handle this by continuing the current scan.
+		 *
+		 * This is just an optimization.  We could handle this in the usual
+		 * way, but it's better to avoid allowing btgettuple to return false
+		 * until there really are no more tuples to return.
+		 */
+		pstate->continuescan = true;	/* Override _bt_check_compare */
+		so->needPrimScan = false;	/* continue primitive scan to next page */
+
+		/* Optimization: avoid scanning any more tuples from this page */
+		if (ScanDirectionIsForward(dir))
+			pstate->skip = pstate->maxoff + 1;
+		else
+			pstate->skip = pstate->minoff - 1;
+
+		/*
+		 * Reset the array keys.  This is a simple way of dealing with
+		 * scanBehind invalidation.
+		 */
+		_bt_start_array_keys(scan, dir);
+	}
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+end_toplevel_scan:
+
+	/*
+	 * End the current primitive index scan, but don't schedule another.
+	 *
+	 * This ends the entire top-level scan in the current scan direction.
+	 *
+	 * Note: The scan's arrays (including any non-required arrays) are now in
+	 * their final positions for the current scan direction.  If the scan
+	 * direction happens to change, then the arrays will already be in their
+	 * first positions for what will then be the current scan direction.
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = false;	/* ...don't call _bt_first again, though */
+
+	/* Caller's tuple doesn't match any qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
  *
- * The given search-type keys (in scan->keyData[] or so->arrayKeyData[])
+ * The given search-type keys (taken from scan->keyData[])
  * are copied to so->keyData[] with possible transformation.
  * scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
  * the number of output keys (possibly less, never greater).
@@ -690,8 +2500,9 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * The output keys must be sorted by index attribute.  Presently we expect
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
- * within each attribute may be done as a byproduct of the processing here,
- * but no other code depends on that.
+ * within each attribute may be done as a byproduct of the processing here.
+ * That process must leave array scan keys (within an attribute) in the same
+ * order as corresponding entries from the scan's BTArrayKeyInfo array info.
  *
  * The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
  * if they must be satisfied in order to continue the scan forward or backward
@@ -748,8 +2559,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
  *
  * Note: the reason we have to copy the preprocessed scan keys into private
  * storage is that we are modifying the array based on comparisons of the
- * key argument values, which could change on a rescan or after moving to
- * new elements of array keys.  Therefore we can't overwrite the source data.
+ * key argument values, which could change on a rescan.  Therefore we can't
+ * overwrite the source data.
  */
 void
 _bt_preprocess_keys(IndexScanDesc scan)
@@ -762,11 +2573,32 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
-	ScanKey		xform[BTMaxStrategyNumber];
+	BTScanKeyPreproc xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
 	AttrNumber	attno;
+	ScanKey		arrayKeyData;
+	int		   *keyDataMap = NULL;
+	int			arrayidx = 0;
+
+	/*
+	 * We're called at the start of each primitive index scan during top-level
+	 * scans that use equality array keys.  We can reuse the scan keys that
+	 * were output at the start of the scan's first primitive index scan.
+	 * There is no need to perform exactly the same work more than once.
+	 */
+	if (so->numberOfKeys > 0)
+	{
+		/*
+		 * An earlier call to _bt_advance_array_keys already set everything up
+		 * for us.  Just assert that the scan's existing output scan keys are
+		 * consistent with its current array elements.
+		 */
+		Assert(so->numArrayKeys);
+		Assert(_bt_verify_keys_with_arraykeys(scan));
+		return;
+	}
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -775,11 +2607,27 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
+	arrayKeyData = _bt_preprocess_array_keys(scan);
+	if (!so->qual_ok)
+	{
+		/* unmatchable array, so give up */
+		return;
+	}
+
 	/*
-	 * Read so->arrayKeyData if array keys are present, else scan->keyData
+	 * Treat arrayKeyData[] (a partially preprocessed copy of scan->keyData[])
+	 * as our input if _bt_preprocess_array_keys just allocated it, else just
+	 * use scan->keyData[]
 	 */
-	if (so->arrayKeyData != NULL)
-		inkeys = so->arrayKeyData;
+	if (arrayKeyData)
+	{
+		inkeys = arrayKeyData;
+
+		/* Also maintain keyDataMap for remapping so->orderProc[] later */
+		keyDataMap = MemoryContextAlloc(so->arrayContext,
+										numberOfKeys * sizeof(int));
+	}
 	else
 		inkeys = scan->keyData;
 
@@ -800,6 +2648,18 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
 			_bt_mark_scankey_required(outkeys);
+		if (arrayKeyData)
+		{
+			/*
+			 * Don't call _bt_preprocess_array_keys_final in this fast path
+			 * (we'll miss out on the single value array transformation, but
+			 * that's not nearly as important when there's only one scan key)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+				   so->arrayKeys[0].scan_key == 0);
+		}
+
 		return;
 	}
 
@@ -859,13 +2719,29 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 * check, and we've rejected any combination of it with a regular
 			 * equality condition; but not with other types of conditions.
 			 */
-			if (xform[BTEqualStrategyNumber - 1])
+			if (xform[BTEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1];
+				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
+				BTArrayKeyInfo *array = NULL;
+				FmgrInfo   *orderproc = NULL;
+
+				if (arrayKeyData && (eq->sk_flags & SK_SEARCHARRAY))
+				{
+					int			eq_in_ikey,
+								eq_arrayidx;
+
+					eq_in_ikey = xform[BTEqualStrategyNumber - 1].ikey;
+					eq_arrayidx = xform[BTEqualStrategyNumber - 1].arrayidx;
+					array = &so->arrayKeys[eq_arrayidx - 1];
+					orderproc = so->orderProcs + eq_in_ikey;
+
+					Assert(array->scan_key == eq_in_ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
 
 				for (j = BTMaxStrategyNumber; --j >= 0;)
 				{
-					ScanKey		chk = xform[j];
+					ScanKey		chk = xform[j].skey;
 
 					if (!chk || j == (BTEqualStrategyNumber - 1))
 						continue;
@@ -878,6 +2754,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					}
 
 					if (_bt_compare_scankey_args(scan, chk, eq, chk,
+												 array, orderproc,
 												 &test_result))
 					{
 						if (!test_result)
@@ -887,7 +2764,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							return;
 						}
 						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						Assert(!array || array->num_elems > 0);
+						xform[j].skey = NULL;
+						xform[j].ikey = -1;
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -896,36 +2775,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			}
 
 			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1]
-				&& xform[BTLessEqualStrategyNumber - 1])
+			if (xform[BTLessStrategyNumber - 1].skey
+				&& xform[BTLessEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1];
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1];
+				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, le, lt, le,
+				if (_bt_compare_scankey_args(scan, le, lt, le, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1] = NULL;
+						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTLessStrategyNumber - 1] = NULL;
+						xform[BTLessStrategyNumber - 1].skey = NULL;
 				}
 			}
 
 			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1]
-				&& xform[BTGreaterEqualStrategyNumber - 1])
+			if (xform[BTGreaterStrategyNumber - 1].skey
+				&& xform[BTGreaterEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1];
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1];
+				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, ge, gt, ge,
+				if (_bt_compare_scankey_args(scan, ge, gt, ge, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1] = NULL;
+						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTGreaterStrategyNumber - 1] = NULL;
+						xform[BTGreaterStrategyNumber - 1].skey = NULL;
 				}
 			}
 
@@ -936,11 +2815,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 */
 			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				if (xform[j])
+				if (xform[j].skey)
 				{
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-					memcpy(outkey, xform[j], sizeof(ScanKeyData));
+					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+					if (arrayKeyData)
+						keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 					if (priorNumberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
 				}
@@ -966,6 +2847,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
+			if (arrayKeyData)
+				keyDataMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 
@@ -977,20 +2860,112 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
-		/* have we seen one of these before? */
-		if (xform[j] == NULL)
+		/*
+		 * Does this input scan key require further processing as an array?
+		 */
+		if (cur->sk_strategy == InvalidStrategy)
 		{
-			/* nope, so remember this scankey */
-			xform[j] = cur;
+			/* _bt_preprocess_array_keys marked this array key redundant */
+			Assert(arrayKeyData);
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			continue;
+		}
+
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			(cur->sk_flags & SK_SEARCHARRAY))
+		{
+			/* _bt_preprocess_array_keys kept this array key */
+			Assert(arrayKeyData);
+			arrayidx++;
+		}
+
+		/*
+		 * have we seen a scan key for this same attribute and using this same
+		 * operator strategy before now?
+		 */
+		if (xform[j].skey == NULL)
+		{
+			/* nope, so this scan key wins by default (at least for now) */
+			xform[j].skey = cur;
+			xform[j].ikey = i;
+			xform[j].arrayidx = arrayidx;
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
-										 &test_result))
+			FmgrInfo   *orderproc = NULL;
+			BTArrayKeyInfo *array = NULL;
+
+			/*
+			 * Seen one of these before, so keep only the more restrictive key
+			 * if possible
+			 */
+			if (j == (BTEqualStrategyNumber - 1) && arrayKeyData)
 			{
+				/*
+				 * Have to set up array keys
+				 */
+				if ((cur->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[arrayidx - 1];
+					orderproc = so->orderProcs + i;
+
+					Assert(array->scan_key == i);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+				else if ((xform[j].skey->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[xform[j].arrayidx - 1];
+					orderproc = so->orderProcs + xform[j].ikey;
+
+					Assert(array->scan_key == xform[j].ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+
+				/*
+				 * Both scan keys might have arrays, in which case we'll
+				 * arbitrarily pass only one of the arrays.  That won't
+				 * matter, since _bt_compare_scankey_args is aware that two
+				 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
+				 * failed to eliminate redundant arrays through array merging.
+				 * _bt_compare_scankey_args just returns false when it sees
+				 * this; it won't even try to examine either array.
+				 */
+			}
+
+			if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+										 array, orderproc, &test_result))
+			{
+				/* Have all we need to determine redundancy */
 				if (test_result)
-					xform[j] = cur;
+				{
+					Assert(!array || array->num_elems > 0);
+
+					/*
+					 * New key is more restrictive, and so replaces old key...
+					 */
+					if (j != (BTEqualStrategyNumber - 1) ||
+						!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
+					{
+						Assert(!array || array->scan_key == i);
+						xform[j].skey = cur;
+						xform[j].ikey = i;
+						xform[j].arrayidx = arrayidx;
+					}
+					else
+					{
+						/*
+						 * ...unless we have to keep the old key because it's
+						 * an array that rendered the new key redundant.  We
+						 * need to make sure that we don't throw away an array
+						 * scan key.  _bt_compare_scankey_args expects us to
+						 * always keep arrays (and discard non-arrays).
+						 */
+						Assert(j == (BTEqualStrategyNumber - 1));
+						Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
+						Assert(xform[j].ikey == array->scan_key);
+						Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+					}
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1002,22 +2977,130 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			else
 			{
 				/*
-				 * We can't determine which key is more restrictive.  Keep the
-				 * previous one in xform[j] and push this one directly to the
-				 * output array.
+				 * We can't determine which key is more restrictive.  Push
+				 * xform[j] directly to the output array, then set xform[j] to
+				 * the new scan key.
+				 *
+				 * Note: We do things this way around so that our arrays are
+				 * always in the same order as their corresponding scan keys,
+				 * even with incomplete opfamilies.  _bt_advance_array_keys
+				 * depends on this.
 				 */
 				ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-				memcpy(outkey, cur, sizeof(ScanKeyData));
+				memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+				if (arrayKeyData)
+					keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 				if (numberOfEqualCols == attno - 1)
 					_bt_mark_scankey_required(outkey);
+				xform[j].skey = cur;
+				xform[j].ikey = i;
+				xform[j].arrayidx = arrayidx;
 			}
 		}
 	}
 
 	so->numberOfKeys = new_numberOfKeys;
+
+	/*
+	 * Now that we've output so->keyData[], and built a temporary mapping from
+	 * so->keyData[] (output scan keys) to scan->keyData[] (input scan keys),
+	 * fix each array->scan_key reference.  (Also consolidates so->orderProc[]
+	 * array, so it can be subscripted using so->keyData[]-wise offsets.)
+	 */
+	if (arrayKeyData)
+		_bt_preprocess_array_keys_final(scan, keyDataMap);
+
+	/* Could pfree arrayKeyData/keyDataMap now, but not worth the cycles */
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * Verify that the scan's qual state matches what we expect at the point that
+ * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
+ *
+ * We enforce a rule against non-required array scan keys: they must start out
+ * with whatever element is the first for the scan's current scan direction.
+ * See _bt_rewind_nonrequired_arrays comments for an explanation.
+ */
+static bool
+_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
+
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+			return false;
+	}
+
+	return _bt_verify_keys_with_arraykeys(scan);
+}
+
+/*
+ * Verify that the scan's "so->keyData[]" scan keys are in agreement with
+ * its array key state
+ */
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			last_sk_attno = InvalidAttrNumber,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		if (array->scan_key != ikey)
+			return false;
+
+		if (array->num_elems <= 0)
+			return false;
+
+		if (cur->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+		if (last_sk_attno > cur->sk_attno)
+			return false;
+		last_sk_attno = cur->sk_attno;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1033,9 +3116,24 @@ _bt_preprocess_keys(IndexScanDesc scan)
  * we store the operator result in *result and return true.  We return false
  * if the comparison could not be made.
  *
+ * If either leftarg or rightarg are an array, we'll apply array-specific
+ * rules to determine which array elements are redundant on behalf of caller.
+ * It is up to our caller to save whichever of the two scan keys is the array,
+ * and discard the non-array scan key (the non-array scan key is guaranteed to
+ * be redundant with any complete opfamily).  Caller isn't expected to call
+ * here with a pair of array scan keys provided we're dealing with a complete
+ * opfamily (_bt_preprocess_array_keys will merge array keys together to make
+ * sure of that).
+ *
+ * Note: we'll also shrink caller's array as needed to eliminate redundant
+ * array elements.  One reason why caller should prefer to discard non-array
+ * scan keys is so that we'll have the opportunity to shrink the array
+ * multiple times, in multiple calls (for each of several other scan keys on
+ * the same index attribute).
+ *
  * Note: op always points at the same ScanKey as either leftarg or rightarg.
- * Since we don't scribble on the scankeys, this aliasing should cause no
- * trouble.
+ * Since we don't scribble on the scankeys themselves, this aliasing should
+ * cause no trouble.
  *
  * Note: this routine needs to be insensitive to any DESC option applied
  * to the index column.  For example, "x < 4" is a tighter constraint than
@@ -1044,6 +3142,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 static bool
 _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 						 ScanKey leftarg, ScanKey rightarg,
+						 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 						 bool *result)
 {
 	Relation	rel = scan->indexRelation;
@@ -1112,6 +3211,48 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 		return true;
 	}
 
+	/*
+	 * If either leftarg or rightarg are equality-type array scankeys, we need
+	 * specialized handling (since by now we know that IS NULL wasn't used)
+	 */
+	if (array)
+	{
+		bool		leftarray,
+					rightarray;
+
+		leftarray = ((leftarg->sk_flags & SK_SEARCHARRAY) &&
+					 leftarg->sk_strategy == BTEqualStrategyNumber);
+		rightarray = ((rightarg->sk_flags & SK_SEARCHARRAY) &&
+					  rightarg->sk_strategy == BTEqualStrategyNumber);
+
+		/*
+		 * _bt_preprocess_array_keys is responsible for merging together array
+		 * scan keys, and will do so whenever the opfamily has the required
+		 * cross-type support.  If it failed to do that, we handle it just
+		 * like the case where we can't make the comparison ourselves.
+		 */
+		if (leftarray && rightarray)
+		{
+			/* Can't make the comparison */
+			*result = false;	/* suppress compiler warnings */
+			return false;
+		}
+
+		/*
+		 * Otherwise we need to determine if either one of leftarg or rightarg
+		 * uses an array, then pass this through to a dedicated helper
+		 * function.
+		 */
+		if (leftarray)
+			return _bt_compare_array_scankey_args(scan, leftarg, rightarg,
+												  orderproc, array, result);
+		else if (rightarray)
+			return _bt_compare_array_scankey_args(scan, rightarg, leftarg,
+												  orderproc, array, result);
+
+		/* FALL THRU */
+	}
+
 	/*
 	 * The opfamily we need to worry about is identified by the index column.
 	 */
@@ -1351,60 +3492,234 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
  * Forward scan callers can pass a high key tuple in the hopes of having
  * us set *continuescan to false, and avoiding an unnecessary visit to
  * the page to the right.
  *
+ * Advances the scan's array keys when necessary for arrayKeys=true callers.
+ * Caller can avoid all array related side-effects when calling just to do a
+ * page continuescan precheck -- pass arrayKeys=false for that.  Scans without
+ * any arrays keys must always pass arrayKeys=false.
+ *
+ * Also stops and starts primitive index scans for arrayKeys=true callers.
+ * Scans with array keys are required to set up page state that helps us with
+ * this.  The page's finaltup tuple (the page high key for a forward scan, or
+ * the page's first non-pivot tuple for a backward scan) must be set in
+ * pstate.finaltup ahead of the first call here for the page (or possibly the
+ * first call after an initial continuescan-setting page precheck call).  Set
+ * this to NULL for rightmost page (or the leftmost page for backwards scans).
+ *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: page level input and output parameters
+ * arrayKeys: should we advance the scan's array keys if necessary?
  * tuple: index tuple to test
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
- * 						   be true for the last item on the page
- * haveFirstMatch: indicates that we already have at least one match
- * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
-			  bool continuescanPrechecked, bool haveFirstMatch)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+			  IndexTuple tuple, int tupnatts)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanDirection dir = pstate->dir;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
 
-	*continuescan = true;		/* default assumption */
+	res = _bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+							arrayKeys, pstate->prechecked, pstate->firstmatch,
+							&pstate->continuescan, &ikey);
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+#ifdef USE_ASSERT_CHECKING
+	if (!arrayKeys && so->numArrayKeys)
 	{
-		Datum		datum;
-		bool		isNull;
-		Datum		test;
-		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+		/*
+		 * This is a continuescan precheck call for a scan with array keys.
+		 *
+		 * Assert that the scan isn't in danger of becoming confused.
+		 */
+		Assert(!so->scanBehind && !pstate->prechecked && !pstate->firstmatch);
+		Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
+											 tupnatts, false, 0, NULL));
+	}
+	if (pstate->prechecked || pstate->firstmatch)
+	{
+		bool		dcontinuescan;
+		int			dikey = 0;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Call relied on continuescan/firstmatch prechecks -- assert that we
+		 * get the same answer without those optimizations
+		 */
+		Assert(res == _bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+										arrayKeys, false, false,
+										&dcontinuescan, &dikey));
+		Assert(pstate->continuescan == dcontinuescan);
+	}
+#endif
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality strategy array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * pstate.continuescan=false.
+	 */
+	if (!arrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.  This could mean that the tuple is just past
+	 * the end of matches for the current array keys.
+	 *
+	 * It's also possible that the scan is still _before_ the _start_ of
+	 * tuples matching the current set of array keys.  Check for that first.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts, true,
+									 ikey, NULL))
+	{
+		/*
+		 * Tuple is still before the start of matches according to the scan's
+		 * required array keys (according to _all_ of its required equality
+		 * strategy keys, actually).
+		 *
+		 * _bt_advance_array_keys occasionally sets so->scanBehind to signal
+		 * that the scan's current position/tuples might be significantly
+		 * behind (multiple pages behind) its current array keys.  When this
+		 * happens, we need to be prepared to recover by starting a new
+		 * primitive index scan here, on our own.
+		 */
+		Assert(!so->scanBehind ||
+			   so->keyData[ikey].sk_strategy == BTEqualStrategyNumber);
+		if (unlikely(so->scanBehind) && pstate->finaltup &&
+			_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
+										 BTreeTupleGetNAtts(pstate->finaltup,
+															scan->indexRelation),
+										 false, 0, NULL))
+		{
+			/* Cut our losses -- start a new primitive index scan now */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/* Override _bt_check_compare, continue primitive scan */
+			pstate->continuescan = true;
+
+			/*
+			 * We will end up here repeatedly given a group of tuples > the
+			 * previous array keys and < the now-current keys (for a backwards
+			 * scan it's just the same, though the operators swap positions).
+			 *
+			 * We must avoid allowing this linear search process to scan very
+			 * many tuples from well before the start of tuples matching the
+			 * current array keys (or from well before the point where we'll
+			 * once again have to advance the scan's array keys).
+			 *
+			 * We keep the overhead under control by speculatively "looking
+			 * ahead" to later still-unscanned items from this same leaf page.
+			 * We'll only attempt this once the number of tuples that the
+			 * linear search process has examined starts to get out of hand.
+			 */
+			pstate->rechecks++;
+			if (pstate->rechecks >= LOOK_AHEAD_REQUIRED_RECHECKS)
+			{
+				/* See if we should skip ahead within the current leaf page */
+				_bt_check_look_ahead(scan, pstate, dir, tupnatts, tupdesc);
+
+				/*
+				 * Might have set pstate.skip to a later page offset.  When
+				 * that happens then _bt_readpage caller will inexpensively
+				 * skip ahead to a later tuple from the same page (the one
+				 * just after the tuple we successfully "looked ahead" to).
+				 */
+			}
+		}
+
+		/* This indextuple doesn't match the current qual, in any case */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scan).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan.
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, tupnatts, tupdesc,
+								  ikey, true);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also sets *continuescan to false
+ * when it's also not possible for any later tuples to pass the current qual
+ * (with the scan's current set of array keys, in the current scan direction),
+ * in addition to setting *ikey to the so->keyData[] subscript/offset for the
+ * unsatisfied scan key (needed when caller must consider advancing the scan's
+ * array keys).
+ *
+ * This is a subroutine for _bt_checkkeys.  We provisionally assume that
+ * reaching the end of the current set of required keys (in particular the
+ * current required array keys) ends the ongoing (primitive) index scan.
+ * Callers without array keys should just end the scan right away when they
+ * find that continuescan has been set to false here by us.  Things are more
+ * complicated for callers with array keys.
+ *
+ * Callers with array keys must first consider advancing the arrays when
+ * continuescan has been set to false here by us.  They must then consider if
+ * it really does make sense to end the current (primitive) index scan, in
+ * light of everything that is known at that point.  (In general when we set
+ * continuescan=false for these callers it must be treated as provisional.)
+ *
+ * We deal with advancing unsatisfied non-required arrays directly, though.
+ * This is safe, since by definition non-required keys can't end the scan.
+ * This is just how we determine if non-required arrays are just unsatisfied
+ * by the current array key, or if they're truly unsatisfied (that is, if
+ * they're unsatisfied by every possible array key).
+ *
+ * Though we advance non-required array keys on our own, that shouldn't have
+ * any lasting consequences for the scan.  By definition, non-required arrays
+ * have no fixed relationship with the scan's progress.  (There are delicate
+ * considerations for non-required arrays when the arrays need to be advanced
+ * following our setting continuescan to false, but that doesn't concern us.)
+ *
+ * Pass advancenonrequired=false to avoid all array related side effects.
+ * This allows _bt_advance_array_keys caller to avoid infinite recursion.
+ */
+static bool
+_bt_check_compare(IndexScanDesc scan, ScanDirection dir,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool advancenonrequired, bool prechecked, bool firstmatch,
+				  bool *continuescan, int *ikey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	*continuescan = true;		/* default assumption */
+
+	for (; *ikey < so->numberOfKeys; (*ikey)++)
+	{
+		ScanKey		key = so->keyData + *ikey;
+		Datum		datum;
+		bool		isNull;
+		bool		requiredSameDir = false,
+					requiredOppositeDirOnly = false;
+
+		/*
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1422,8 +3737,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
-			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
+		if (prechecked &&
+			(requiredSameDir || (requiredOppositeDirOnly && firstmatch)) &&
+			!(key->sk_flags & SK_ROW_HEADER))
 			continue;
 
 		if (key->sk_attno > tupnatts)
@@ -1434,7 +3750,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1495,6 +3810,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1511,6 +3828,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1524,24 +3843,15 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key-checking function, though only if we must.
+		 *
+		 * When a key is required in the opposite-of-scan direction _only_,
+		 * then it must already be satisfied if firstmatch=true indicates that
+		 * an earlier tuple from this same page satisfied it earlier on.
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
-		{
-			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-									 datum, key->sk_argument);
-		}
-		else
-		{
-			test = true;
-			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
-											 datum, key->sk_argument));
-		}
-
-		if (!DatumGetBool(test))
+		if (!(requiredOppositeDirOnly && firstmatch) &&
+			!DatumGetBool(FunctionCall2Coll(&key->sk_func, key->sk_collation,
+											datum, key->sk_argument)))
 		{
 			/*
 			 * Tuple fails this qual.  If it's a required qual for the current
@@ -1557,7 +3867,19 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				*continuescan = false;
 
 			/*
-			 * In any case, this indextuple doesn't match the qual.
+			 * If this is a non-required equality-type array key, the tuple
+			 * needs to be checked against every possible array key.  Handle
+			 * this by "advancing" the scan key's array to a matching value
+			 * (if we're successful then the tuple might match the qual).
+			 */
+			else if (advancenonrequired &&
+					 key->sk_strategy == BTEqualStrategyNumber &&
+					 (key->sk_flags & SK_SEARCHARRAY))
+				return _bt_advance_array_keys(scan, NULL, tuple, tupnatts,
+											  tupdesc, *ikey, false);
+
+			/*
+			 * This indextuple doesn't match the qual.
 			 */
 			return false;
 		}
@@ -1574,7 +3896,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1603,7 +3925,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
@@ -1630,6 +3951,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1646,6 +3969,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1741,6 +4066,103 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 	return result;
 }
 
+/*
+ * Determine if a scan with array keys should consider looking ahead.
+ *
+ * This is a subroutine for _bt_checkkeys.  It limits the worst case cost of
+ * _bt_readpage's linear search.  Many scans that use array keys won't run
+ * into this problem, since "looking ahead" to pstate.finaltup at the point
+ * that the scan's arrays advance usually suffices.  It is worth controlling
+ * the cost of _bt_readpage's _bt_checkkeys-based linear search on pages that
+ * contain key space that matches several distinct array keys (or distinct
+ * sets of array keys) spaced apart by dozens or hundreds of non-pivot tuples.
+ *
+ * When we perform look ahead, and the process succeeds, sets pstate.skip,
+ * which instructs _bt_readpage to skip ahead to that tuple next (could be
+ * past the end of the scan's leaf page).
+ *
+ * We ramp the look ahead distance up as it continues to be effective, and
+ * aggressively decrease it when it stops working.  Cases where looking ahead
+ * is very effective will still require several calls here per _bt_readpage.
+ *
+ * Calling here stashes information about the progress of array advancement on
+ * the page using certain private fields in pstate.  We need to track our
+ * progress so far to control ramping the optimization up (and down).
+ */
+static void
+_bt_check_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
+					 ScanDirection dir, int tupnatts, TupleDesc tupdesc)
+{
+	OffsetNumber mid;
+	IndexTuple	ahead;
+	int			distance;
+
+	/*
+	 * Don't look ahead when there aren't enough tuples remaining on the page
+	 * (in the current scan direction) for it to really make sense.
+	 */
+	distance = LOOK_AHEAD_DEFAULT_DISTANCE;
+	if (ScanDirectionIsForward(dir))
+	{
+		if (pstate->offnum >= pstate->maxoff - distance)
+			return;
+
+		/* Also don't do anything with high key */
+		if (pstate->offnum < pstate->minoff)
+			return;
+
+		mid = pstate->offnum + distance;
+	}
+	else
+	{
+		if (pstate->offnum <= pstate->minoff + distance)
+			return;
+
+		mid = pstate->offnum - distance;
+	}
+
+	/*
+	 * The look ahead distance starts small, and ramps up as each call here
+	 * allows _bt_readpage to skip ever-more tuples from the current page
+	 */
+	if (!pstate->targetdistance)
+		pstate->targetdistance = distance;
+	else
+		pstate->targetdistance *= 2;
+
+	if (ScanDirectionIsForward(dir))
+		mid = Min(pstate->maxoff, pstate->offnum + pstate->targetdistance);
+	else
+		mid = Max(pstate->minoff, pstate->offnum - pstate->targetdistance);
+
+	ahead = (IndexTuple) PageGetItem(pstate->page,
+									 PageGetItemId(pstate->page, mid));
+	if (_bt_tuple_before_array_skeys(scan, dir, ahead, tupdesc, tupnatts,
+									 false, 0, NULL))
+	{
+		/*
+		 * Success -- instruct _bt_readpage to skip ahead to very next tuple
+		 * after the one we determined was still before the current array keys
+		 */
+		if (ScanDirectionIsForward(dir))
+			pstate->skip = mid + 1;
+		else
+			pstate->skip = mid - 1;
+	}
+	else
+	{
+		/*
+		 * Failure -- "ahead" tuple is too far ahead (we were too aggresive).
+		 *
+		 * Reset the number of rechecks, and aggressively reduce the target
+		 * distance.  Note that we're much more aggressive here than when
+		 * initially ramping up.
+		 */
+		pstate->rechecks = 0;
+		pstate->targetdistance /= 8;
+	}
+}
+
 /*
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9e35aaf56..fcf6d1d93 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -628,6 +628,8 @@ ExecIndexOnlyScanEstimate(IndexOnlyScanState *node,
 	EState	   *estate = node->ss.ps.state;
 
 	node->ioss_PscanLen = index_parallelscan_estimate(node->ioss_RelationDesc,
+													  node->ioss_NumScanKeys,
+													  node->ioss_NumOrderByKeys,
 													  estate->es_snapshot);
 	shm_toc_estimate_chunk(&pcxt->estimator, node->ioss_PscanLen);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2a3264599..8000feff4 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1644,6 +1644,8 @@ ExecIndexScanEstimate(IndexScanState *node,
 	EState	   *estate = node->ss.ps.state;
 
 	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 node->iss_NumScanKeys,
+													 node->iss_NumOrderByKeys,
 													 estate->es_snapshot);
 	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbd..2230b1310 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,12 +816,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
 	int			indexcol;
 
+	Assert(skip_nonnative_saop != NULL || scantype == ST_BITMAPSCAN);
+
 	/*
 	 * Check that index supports the desired scan type(s)
 	 */
@@ -880,19 +849,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +864,18 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			if (skip_nonnative_saop && !index->amsearcharray &&
+				IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
-				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
+				/*
+				 * Caller asked us to generate IndexPaths that omit any
+				 * ScalarArrayOpExpr clauses when the underlying index AM
+				 * lacks native support.
+				 *
+				 * We must omit this clause (and tell caller about it).
+				 */
+				*skip_nonnative_saop = true;
+				continue;
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d..772dc664f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6557,8 +6557,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6573,7 +6571,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed.
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6603,19 +6601,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6653,27 +6640,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6683,11 +6674,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan.
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6703,10 +6692,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6719,7 +6706,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6797,7 +6784,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6812,17 +6798,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6862,14 +6843,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				double		alength = estimate_array_length(root, other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6929,13 +6905,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6945,6 +6914,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 *
+	 * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+	 * clause quals (which the B-Tree code uses "non-required" scan keys for)
+	 * won't actually contribute to the total number of descents of the index.
+	 * This would require pushing down more context into genericcostestimate.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6952,9 +6963,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +6982,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index b68daa55a..76ac0fcdd 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -809,7 +809,8 @@ amrestrpos (IndexScanDesc scan);
   <para>
 <programlisting>
 Size
-amestimateparallelscan (void);
+amestimateparallelscan (int nkeys,
+                        int norderbys);
 </programlisting>
    Estimate and return the number of bytes of dynamic shared memory which
    the access method will be needed to perform a parallel scan.  (This number
@@ -817,6 +818,13 @@ amestimateparallelscan (void);
    AM-independent data in <structname>ParallelIndexScanDescData</structname>.)
   </para>
 
+  <para>
+   The <literal>nkeys</literal> and <literal>norderbys</literal>
+   parameters indicate the number of quals and ordering operators that will be
+   used in the scan; the same values will be passed to <function>amrescan</function>.
+   Note that the actual values of the scan keys aren't provided yet.
+  </para>
+
   <para>
    It is not necessary to implement this function for access methods which
    do not support parallel scans or for which the number of additional bytes
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac28..e49a4c0c1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4064,6 +4064,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Queries that use certain <acronym>SQL</acronym> constructs to search for
+    rows matching any value out of a list or array of multiple scalar values
+    (see <xref linkend="functions-comparisons"/>) perform multiple
+    <quote>primitive</quote> index scans (up to one primitive scan per scalar
+    value) during query execution.  Each internal primitive index scan
+    increments <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>,
+    so it's possible for the count of index scans to significantly exceed the
+    total number of index scan executor node executions.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 8311a03c3..510646cbc 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -189,6 +189,58 @@ select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limi
       48 |      8
 (1 row)
 
+--
+-- Add coverage for ScalarArrayOp btree quals with pivot tuple constants
+--
+explain (costs off)
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82);
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Unique
+   ->  Index Only Scan using tenk1_hundred on tenk1
+         Index Cond: (hundred = ANY ('{47,48,72,82}'::integer[]))
+(3 rows)
+
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82);
+ hundred 
+---------
+      47
+      48
+      72
+      82
+(4 rows)
+
+explain (costs off)
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82) order by hundred desc;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Unique
+   ->  Index Only Scan Backward using tenk1_hundred on tenk1
+         Index Cond: (hundred = ANY ('{47,48,72,82}'::integer[]))
+(3 rows)
+
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82) order by hundred desc;
+ hundred 
+---------
+      82
+      72
+      48
+      47
+(4 rows)
+
+explain (costs off)
+select thousand from tenk1 where thousand in (364, 366,380) and tenthous = 200000;
+                                      QUERY PLAN                                       
+---------------------------------------------------------------------------------------
+ Index Only Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand = ANY ('{364,366,380}'::integer[])) AND (tenthous = 200000))
+(2 rows)
+
+select thousand from tenk1 where thousand in (364, 366,380) and tenthous = 200000;
+ thousand 
+----------
+(0 rows)
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 70ab47a92..cf6eac573 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1698,6 +1698,12 @@ SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
      0
 (1 row)
 
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IN (-1, 0, 1);
+ count 
+-------
+     1
+(1 row)
+
 DROP INDEX onek_nulltest;
 CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 desc nulls last,unique1);
 SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
@@ -1910,7 +1916,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1932,49 +1938,186 @@ ORDER BY unique1;
       42
 (3 rows)
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ thousand | tenthous 
+----------+----------
+        0 |     3000
+        1 |     1001
+(2 rows)
+
+-- Non-required array scan key on "tenthous", backward scan:
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
+--
+-- Check elimination of redundant and contradictory index quals
+--
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = ANY('{7, 8, 9}');
+                                             QUERY PLAN                                             
+----------------------------------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 = ANY ('{7,8,9}'::integer[])))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = ANY('{7, 8, 9}');
+ unique1 
+---------
+       7
+(1 row)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 = ANY('{7, 14, 22}') and unique1 = ANY('{33, 44}'::bigint[]);
+                                             QUERY PLAN                                             
+----------------------------------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{7,14,22}'::integer[])) AND (unique1 = ANY ('{33,44}'::bigint[])))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 = ANY('{7, 14, 22}') and unique1 = ANY('{33, 44}'::bigint[]);
+ unique1 
+---------
+(0 rows)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 1;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 = 1))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 1;
+ unique1 
+---------
+       1
+(1 row)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 12345;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 = 12345))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 12345;
+ unique1 
+---------
+(0 rows)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 >= 42;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 >= 42))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 >= 42;
+ unique1 
+---------
+      42
+(1 row)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 > 42;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 > 42))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 > 42;
+ unique1 
+---------
+(0 rows)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 > 9996 and unique1 >= 9999;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 > 9996) AND (unique1 >= 9999))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 > 9996 and unique1 >= 9999;
+ unique1 
+---------
+    9999
+(1 row)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 <= 3;
+                    QUERY PLAN                    
+--------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 < 3) AND (unique1 <= 3))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 <= 3;
+ unique1 
+---------
+       0
+       1
+       2
 (3 rows)
 
-SELECT thousand, tenthous FROM tenk1
-WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
- thousand | tenthous 
-----------+----------
-        0 |     3000
-        1 |     1001
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 < (-1)::bigint;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 < 3) AND (unique1 < '-1'::bigint))
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 < (-1)::bigint;
+ unique1 
+---------
+(0 rows)
+
 explain (costs off)
-SELECT thousand, tenthous FROM tenk1
-WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 < (-1)::bigint;
                                       QUERY PLAN                                      
 --------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
-
-SELECT thousand, tenthous FROM tenk1
-WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
- thousand | tenthous 
-----------+----------
-        0 |     3000
-        1 |     1001
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 < '-1'::bigint))
 (2 rows)
 
-RESET enable_indexonlyscan;
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 < (-1)::bigint;
+ unique1 
+---------
+(0 rows)
+
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 63cddac0d..8b640c2fc 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8880,10 +8880,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 4ffc5b4c5..87273fa63 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -361,6 +361,7 @@ alter table tenk2 reset (parallel_workers);
 -- test parallel index scans.
 set enable_seqscan to off;
 set enable_bitmapscan to off;
+set random_page_cost = 2;
 explain (costs off)
 	select  count((unique1)) from tenk1 where hundred > 1;
                              QUERY PLAN                             
@@ -379,6 +380,30 @@ select  count((unique1)) from tenk1 where hundred > 1;
   9800
 (1 row)
 
+-- Parallel ScalarArrayOp index scan
+explain (costs off)
+  select count((unique1)) from tenk1
+  where hundred = any ((select array_agg(i) from generate_series(1, 100, 15) i)::int[]);
+                             QUERY PLAN                              
+---------------------------------------------------------------------
+ Finalize Aggregate
+   InitPlan 1
+     ->  Aggregate
+           ->  Function Scan on generate_series i
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial Aggregate
+               ->  Parallel Index Scan using tenk1_hundred on tenk1
+                     Index Cond: (hundred = ANY ((InitPlan 1).col1))
+(9 rows)
+
+select count((unique1)) from tenk1
+where hundred = any ((select array_agg(i) from generate_series(1, 100, 15) i)::int[]);
+ count 
+-------
+   700
+(1 row)
+
 -- test parallel index-only scans.
 explain (costs off)
 	select  count(*) from tenk1 where thousand > 95;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index ef8435423..0d2a33f37 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -135,6 +135,21 @@ explain (costs off)
 select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
 select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
 
+--
+-- Add coverage for ScalarArrayOp btree quals with pivot tuple constants
+--
+explain (costs off)
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82);
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82);
+
+explain (costs off)
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82) order by hundred desc;
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82) order by hundred desc;
+
+explain (costs off)
+select thousand from tenk1 where thousand in (364, 366,380) and tenthous = 200000;
+select thousand from tenk1 where thousand in (364, 366,380) and tenthous = 200000;
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..e296891ca 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -668,6 +668,7 @@ SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
 SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
 SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
 SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IN (-1, 0, 1);
 
 DROP INDEX onek_nulltest;
 
@@ -753,7 +754,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -765,6 +766,7 @@ SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -774,18 +776,68 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
-RESET enable_indexonlyscan;
+--
+-- Check elimination of redundant and contradictory index quals
+--
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = ANY('{7, 8, 9}');
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = ANY('{7, 8, 9}');
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 = ANY('{7, 14, 22}') and unique1 = ANY('{33, 44}'::bigint[]);
+
+SELECT unique1 FROM tenk1 WHERE unique1 = ANY('{7, 14, 22}') and unique1 = ANY('{33, 44}'::bigint[]);
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 1;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 1;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 12345;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 12345;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 >= 42;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 >= 42;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 > 42;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 > 42;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 > 9996 and unique1 >= 9999;
+
+SELECT unique1 FROM tenk1 WHERE unique1 > 9996 and unique1 >= 9999;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 <= 3;
+
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 <= 3;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 < (-1)::bigint;
+
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 < (-1)::bigint;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 < (-1)::bigint;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 < (-1)::bigint;
 
 --
 -- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index c43a5b211..20376c03f 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -137,11 +137,19 @@ alter table tenk2 reset (parallel_workers);
 -- test parallel index scans.
 set enable_seqscan to off;
 set enable_bitmapscan to off;
+set random_page_cost = 2;
 
 explain (costs off)
 	select  count((unique1)) from tenk1 where hundred > 1;
 select  count((unique1)) from tenk1 where hundred > 1;
 
+-- Parallel ScalarArrayOp index scan
+explain (costs off)
+  select count((unique1)) from tenk1
+  where hundred = any ((select array_agg(i) from generate_series(1, 100, 15) i)::int[]);
+select count((unique1)) from tenk1
+where hundred = any ((select array_agg(i) from generate_series(1, 100, 15) i)::int[]);
+
 -- test parallel index-only scans.
 explain (costs off)
 	select  count(*) from tenk1 where thousand > 95;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8d7bed41..3f0b3f47e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -208,8 +208,10 @@ BTPageStat
 BTPageState
 BTParallelScanDesc
 BTPendingFSM
+BTReadPageState
 BTScanInsert
 BTScanInsertData
+BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos
-- 
2.43.0

#60

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Peter Geoghegan (#59)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Apr 1, 2024 at 6:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

Note: v18 doesn't have any adjustments to the costing, as originally
planned. I'll probably need to post a revised patch with improved (or
at least polished) costing in the next few days, so that others will
have the opportunity to comment before I commit the patch.

Attached is v19, which dealt with remaining concerns I had about the
costing in selfuncs.c. My current plan is to commit this on Saturday
morning (US Eastern time).

I ultimately concluded that selfuncs.c shouldn't be doing anything to
cap the estimated number of primitive index scans for an SAOP clause,
unless doing so is all but certain to make the estimate more accurate.
And so v19 makes the changes in selfuncs.c much closer to the approach
used to cost SOAP clauses on master. In short, v19 is a lot more
conservative in the changes made to selfuncs.c.

v19 does hold onto the idea of aggressively capping the estimate in
extreme cases -- cases where the estimate begins to approach the total
number of leaf pages in the index. This is clearly safe (it's
literally impossible to have more index descents than there are leaf
pages in the index), and probably necessary given that index paths
with SAOP filter quals (on indexed attributes) will no longer be
generated by the planner.

To recap, since nbtree more or less behaves as if scans with SAOPs are
one continuous index scan under the new design from patch, it was
tempting to make code in genericcostestimate/btcostestimate treat it
like that, too (the selfuncs.c changes from earlier versions tried to
do that). As I said already, I've now given up on that approach in
v19. This does mean that genericcostestimate continues to apply the
Mackert-Lohman formula for btree SAOP scans. This is a bit odd in a
world where nbtree specifically promises to not repeat any leaf page
accesses. I was a bit uneasy about that aspect for a while, but
ultimately concluded that it wasn't very important in the grand scheme
of things.

The problem with teaching code in genericcostestimate/btcostestimate
to treat nbtree scans with SAOPs are one continuous index scan is that
it hugely biases genericcostestimate's simplistic approach to
estimating numIndexPages. genericcostestimate does this by simply
prorating using numIndexTuples/index->pages/index->tuples. That naive
approach probably works alright with simple scalar operators, but it's
not going to work with SAOPs directly. I can't think of a good way of
estimating numIndexPages with SAOPs, and suspect that the required
information just isn't available. It seems best to treat it as a
possible area for improvement in a later release. The really important
thing for v17 is that we never wildly overestimate the number of
descents when multiple SAOPs are used, each with a medium or large
array.

v19 also makes sure that genericcostestimate doesn't allow a
non-boundary SAOP clause/non-required array scan key to affect
num_sa_scans -- it'll just use the num_sa_scans values used by
btcostestimate in v19. This makes sense because nbtree is guaranteed
to not start new primitive scans for these sorts of scan keys (plus
there's no need to calculate num_sa_scans twice, in two places, using
two slightly different approaches).

--
Peter Geoghegan

Attachments:

v19-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v19-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload

From 2890891a2c98c102b3ff6bd51f6923659c461dac Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v19] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime.  This approach can be
far more efficient.  Many cases that previously required thousands of
index descents now require as few as one single index descent.  And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process.  We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order.  Such scans will now be executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allows us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (paths that used SAOP filter quals instead).
We'll no longer generate these alternative paths, since they can no
longer have any meaningful advantages over standard index qual paths.

Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index.  Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions.  Such an index AM could have the
same limitations around ordered SOAP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.

Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/amapi.h                    |    2 +-
 src/include/access/genam.h                    |    3 +-
 src/include/access/nbtree.h                   |   80 +-
 src/include/utils/selfuncs.h                  |    3 +
 src/backend/access/index/indexam.c            |   10 +-
 src/backend/access/nbtree/nbtree.c            |  229 +-
 src/backend/access/nbtree/nbtsearch.c         |  237 +-
 src/backend/access/nbtree/nbtutils.c          | 2936 +++++++++++++++--
 src/backend/executor/nodeIndexonlyscan.c      |    2 +
 src/backend/executor/nodeIndexscan.c          |    2 +
 src/backend/optimizer/path/indxpath.c         |   90 +-
 src/backend/utils/adt/selfuncs.c              |   83 +-
 doc/src/sgml/indexam.sgml                     |   10 +-
 doc/src/sgml/monitoring.sgml                  |   13 +
 src/test/regress/expected/btree_index.out     |   52 +
 src/test/regress/expected/create_index.out    |  203 +-
 src/test/regress/expected/join.out            |    5 +-
 src/test/regress/expected/select_parallel.out |   25 +
 src/test/regress/sql/btree_index.sql          |   15 +
 src/test/regress/sql/create_index.sql         |   64 +-
 src/test/regress/sql/select_parallel.sql      |    8 +
 src/tools/pgindent/typedefs.list              |    2 +
 22 files changed, 3508 insertions(+), 566 deletions(-)

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 2c6c307ef..00300dd72 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -194,7 +194,7 @@ typedef void (*amrestrpos_function) (IndexScanDesc scan);
  */
 
 /* estimate size of parallel scan descriptor */
-typedef Size (*amestimateparallelscan_function) (void);
+typedef Size (*amestimateparallelscan_function) (int nkeys, int norderbys);
 
 /* prepare for parallel index scan */
 typedef void (*aminitparallelscan_function) (void *target);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8026c2b36..fdcfbe8db 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -165,7 +165,8 @@ extern void index_rescan(IndexScanDesc scan,
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
-extern Size index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot);
+extern Size index_parallelscan_estimate(Relation indexRelation,
+										int nkeys, int norderbys, Snapshot snapshot);
 extern void index_parallelscan_initialize(Relation heapRelation,
 										  Relation indexRelation, Snapshot snapshot,
 										  ParallelIndexScanDesc target);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052..6c2136814 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -960,11 +960,21 @@ typedef struct BTScanPosData
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 * We can clear the appropriate one of these flags when _bt_checkkeys()
-	 * returns continuescan = false.
+	 * sets BTReadPageState.continuescan = false.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
 
+	/*
+	 * Direction of the scan at the time that _bt_readpage was called.
+	 *
+	 * Used by btrestrpos to "restore" the scan's array keys by resetting each
+	 * array to its first element's value (first in this scan direction).
+	 * _bt_checkkeys can quickly advance the array keys when required, so this
+	 * avoids the need to directly track the array keys in btmarkpos.
+	 */
+	ScanDirection dir;
+
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
 	 * location in the associated tuple storage workspace.
@@ -1022,9 +1032,8 @@ typedef BTScanPosData *BTScanPos;
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
-	int			scan_key;		/* index of associated key in arrayKeyData */
+	int			scan_key;		/* index of associated key in keyData */
 	int			cur_elem;		/* index of current element in elem_values */
-	int			mark_elem;		/* index of marked element in elem_values */
 	int			num_elems;		/* number of elems in current array value */
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
@@ -1037,14 +1046,11 @@ typedef struct BTScanOpaqueData
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
 	/* workspace for SK_SEARCHARRAY support */
-	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
-	int			numArrayKeys;	/* number of equality-type array keys (-1 if
-								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	int			numArrayKeys;	/* number of equality-type array keys */
+	bool		needPrimScan;	/* New prim scan to continue in current dir? */
+	bool		scanBehind;		/* First match for new keys on later page? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for all equality-type keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1081,42 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
+	ScanDirection dir;			/* current scan direction */
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	IndexTuple	finaltup;		/* Needed by scans with array keys */
+	BlockNumber prev_scan_page; /* Last _bt_parallel_release block */
+	Page		page;			/* For array keys "look ahead" optimization */
+
+	/* Per-tuple input parameters, set by _bt_readpage for _bt_checkkeys */
+	OffsetNumber offnum;		/* current tuple's page offset number */
+
+	/* Output parameter, set by _bt_checkkeys for _bt_readpage */
+	OffsetNumber skip;			/* Array keys "look ahead" skip offnum */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/*
+	 * Input and output parameters, set and unset by both _bt_readpage and
+	 * _bt_checkkeys to manage precheck optimizations
+	 */
+	bool		prechecked;		/* precheck set continuescan? */
+	bool		firstmatch;		/* at least one match so far?  */
+
+	/*
+	 * Private _bt_checkkeys state (used to manage "look ahead" optimization,
+	 * used only during scans that have array keys)
+	 */
+	uint16		rechecks;
+	uint16		targetdistance;
+
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1128,7 +1170,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
-extern Size btestimateparallelscan(void);
+extern Size btestimateparallelscan(int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
@@ -1149,10 +1191,12 @@ extern bool btcanreturn(Relation index, int attno);
 /*
  * prototypes for internal functions in nbtree.c
  */
-extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
+extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
+							   bool first);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern bool _bt_parallel_primscan_schedule(IndexScanDesc scan,
+										   BlockNumber prev_scan_page);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1243,15 +1287,11 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_preprocess_array_keys(IndexScanDesc scan);
+extern bool _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern void _bt_mark_array_keys(IndexScanDesc scan);
-extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan,
-						  bool requiredMatchedByPrecheck, bool haveFirstMatch);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+						  IndexTuple tuple, int tupnatts);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 2fa4c4fc1..f2563ad1c 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -117,6 +117,9 @@ typedef struct VariableStatData
  * Callers should initialize all fields of GenericCosts to zero.  In addition,
  * they can set numIndexTuples to some positive value if they have a better
  * than default way of estimating the number of leaf index tuples visited.
+ * Similarly, they can set num_sa_scans to some value >= 1 for an index AM
+ * that doesn't necessarily perform exactly one primitive index scan per
+ * distinct combination of ScalarArrayOp array elements.
  */
 typedef struct
 {
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 78ac3b1ab..7510159fc 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -449,13 +449,10 @@ index_restrpos(IndexScanDesc scan)
 
 /*
  * index_parallelscan_estimate - estimate shared memory for parallel scan
- *
- * Currently, we don't pass any information to the AM-specific estimator,
- * so it can probably only return a constant.  In the future, we might need
- * to pass more information.
  */
 Size
-index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot)
+index_parallelscan_estimate(Relation indexRelation, int nkeys, int norderbys,
+							Snapshot snapshot)
 {
 	Size		nbytes;
 
@@ -474,7 +471,8 @@ index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot)
 	 */
 	if (indexRelation->rd_indam->amestimateparallelscan != NULL)
 		nbytes = add_size(nbytes,
-						  indexRelation->rd_indam->amestimateparallelscan());
+						  indexRelation->rd_indam->amestimateparallelscan(nkeys,
+																		  norderbys));
 
 	return nbytes;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 41df1027d..70ec9f904 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -40,6 +40,9 @@
 /*
  * BTPARALLEL_NOT_INITIALIZED indicates that the scan has not started.
  *
+ * BTPARALLEL_NEED_PRIMSCAN indicates that some process must now seize the
+ * scan to advance it via another call to _bt_first.
+ *
  * BTPARALLEL_ADVANCING indicates that some process is advancing the scan to
  * a new page; others must wait.
  *
@@ -47,11 +50,11 @@
  * to a new page; some process can start doing that.
  *
  * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
  */
 typedef enum
 {
 	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_NEED_PRIMSCAN,
 	BTPARALLEL_ADVANCING,
 	BTPARALLEL_IDLE,
 	BTPARALLEL_DONE,
@@ -67,10 +70,14 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
-	slock_t		btps_mutex;		/* protects above variables */
+	slock_t		btps_mutex;		/* protects above variables, btps_arrElems */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+
+	/*
+	 * btps_arrElems is used when scans need to schedule another primitive
+	 * index scan.  Holds BTArrayKeyInfo.cur_elem offsets for scan keys.
+	 */
+	int			btps_arrElems[FLEXIBLE_ARRAY_MEMBER];
 }			BTParallelScanDescData;
 
 typedef struct BTParallelScanDescData *BTParallelScanDesc;
@@ -204,21 +211,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
-	/*
-	 * If we have any array keys, initialize them during first call for a
-	 * scan.  We can't do this in btrescan because we don't know the scan
-	 * direction at that time.
-	 */
-	if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return false;
-
-		_bt_start_array_keys(scan, dir);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/*
@@ -260,8 +253,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
-		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
 	return res;
 }
@@ -276,19 +269,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
-	/*
-	 * If we have any array keys, initialize them.
-	 */
-	if (so->numArrayKeys)
-	{
-		/* punt if we have any unsatisfiable array keys */
-		if (so->numArrayKeys < 0)
-			return ntids;
-
-		_bt_start_array_keys(scan, ForwardScanDirection);
-	}
-
-	/* This loop handles advancing to the next array elements, if any */
+	/* Each loop iteration performs another primitive index scan */
 	do
 	{
 		/* Fetch the first page & tuple */
@@ -318,8 +299,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				ntids++;
 			}
 		}
-		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+		/* Now see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -348,10 +329,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	else
 		so->keyData = NULL;
 
-	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
+	so->scanBehind = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -391,7 +373,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->numArrayKeys = 0;
+	so->needPrimScan = false;
+	so->scanBehind = false;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -425,9 +409,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scankey,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
-
-	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
-	_bt_preprocess_array_keys(scan);
 }
 
 /*
@@ -455,7 +436,7 @@ btendscan(IndexScanDesc scan)
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
-	/* so->arrayKeyData and so->arrayKeys are in arrayContext */
+	/* so->arrayKeys is in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
 	if (so->killedItems != NULL)
@@ -490,10 +471,6 @@ btmarkpos(IndexScanDesc scan)
 		BTScanPosInvalidate(so->markPos);
 		so->markItemIndex = -1;
 	}
-
-	/* Also record the current positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_mark_array_keys(scan);
 }
 
 /*
@@ -504,10 +481,6 @@ btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* Restore the marked positions of any array keys */
-	if (so->numArrayKeys)
-		_bt_restore_array_keys(scan);
-
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -546,6 +519,12 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			/* Reset the scan's array keys (see _bt_steppage for why) */
+			if (so->numArrayKeys)
+			{
+				_bt_start_array_keys(scan, so->currPos.dir);
+				so->needPrimScan = false;
+			}
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
@@ -556,9 +535,10 @@ btrestrpos(IndexScanDesc scan)
  * btestimateparallelscan -- estimate storage for BTParallelScanDescData
  */
 Size
-btestimateparallelscan(void)
+btestimateparallelscan(int nkeys, int norderbys)
 {
-	return sizeof(BTParallelScanDescData);
+	/* Pessimistically assume every input scan key will be an array */
+	return offsetof(BTParallelScanDescData, btps_arrElems) + sizeof(int) * nkeys;
 }
 
 /*
@@ -572,7 +552,6 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -598,7 +577,6 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -608,23 +586,26 @@ btparallelrescan(IndexScanDesc scan)
  *		or _bt_parallel_done().
  *
  * The return value is true if we successfully seized the scan and false
- * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * if we did not.  The latter case occurs if no pages remain.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
- * means the scan hasn't yet started, and P_NONE means we've reached the end.
- * The first time a participating process reaches the last page, it will return
- * true and set *pageno to P_NONE; after that, further attempts to seize the
- * scan will return false.
+ * means the scan hasn't yet started, or that caller needs to start the next
+ * primitive index scan (if it's the latter case we'll set so.needPrimScan).
+ * The first time a participating process reaches the last page, it will
+ * return true and set *pageno to P_NONE; after that, further attempts to
+ * seize the scan will return false.
  *
  * Callers should ignore the value of pageno if the return value is false.
+ *
+ * Callers that are in a position to start a new primitive index scan must
+ * pass first=true; all other callers just pass false.  We just return false
+ * for first=false callers that need to start the next primitive index scan.
  */
 bool
-_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BTPS_State	pageStatus;
 	bool		exit_loop = false;
 	bool		status = true;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -632,28 +613,67 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 
 	*pageno = P_NONE;
 
+	if (first)
+	{
+		/*
+		 * Initialize array related state when called from _bt_first, assuming
+		 * that this will either be the first primitive index scan for the
+		 * scan, or a previous explicitly scheduled primitive scan.
+		 *
+		 * Note: so->needPrimScan should only be set for an explicitly
+		 * scheduled primitive index scan.
+		 */
+		so->needPrimScan = false;
+		so->scanBehind = false;
+	}
+	else
+	{
+		/*
+		 * Don't attempt to seize the scan when another primitive index scan
+		 * has been scheduled within this backend, unless called by _bt_first
+		 */
+		if (so->needPrimScan)
+			return false;
+	}
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
 	while (1)
 	{
 		SpinLockAcquire(&btscan->btps_mutex);
-		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* We're done with this parallel index scan */
 			status = false;
 		}
-		else if (pageStatus == BTPARALLEL_DONE)
+		else if (btscan->btps_pageStatus == BTPARALLEL_NEED_PRIMSCAN)
 		{
+			Assert(so->numArrayKeys);
+
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * If we can start another primitive scan right away, do so.
+			 * Otherwise just wait.
 			 */
-			status = false;
+			if (first)
+			{
+				btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+				for (int i = 0; i < so->numArrayKeys; i++)
+				{
+					BTArrayKeyInfo *array = &so->arrayKeys[i];
+					ScanKey		skey = &so->keyData[array->scan_key];
+
+					array->cur_elem = btscan->btps_arrElems[i];
+					skey->sk_argument = array->elem_values[array->cur_elem];
+				}
+				so->needPrimScan = true;
+				so->scanBehind = false;
+				*pageno = InvalidBlockNumber;
+				exit_loop = true;
+			}
 		}
-		else if (pageStatus != BTPARALLEL_ADVANCING)
+		else if (btscan->btps_pageStatus != BTPARALLEL_ADVANCING)
 		{
 			/*
 			 * We have successfully seized control of the scan for the purpose
@@ -677,6 +697,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
  * _bt_parallel_release() -- Complete the process of advancing the scan to a
  *		new page.  We now have the new value btps_scanPage; some other backend
  *		can now begin advancing the scan.
+ *
+ * Callers whose scan uses array keys must save their scan_page argument so
+ * that it can be passed to _bt_parallel_primscan_schedule, should caller
+ * determine that another primitive index scan is required.  If that happens,
+ * scan_page won't be scanned by any backend (unless the next primitive index
+ * scan lands on scan_page, which is something we generally try to avoid).
  */
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
@@ -704,7 +730,6 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 void
 _bt_parallel_done(IndexScanDesc scan)
 {
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
@@ -717,13 +742,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the parallel scan as done, unless some other process did so
+	 * already
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
-		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	if (btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
 		status_changed = true;
@@ -736,31 +759,63 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_primscan_schedule() -- Schedule another primitive index scan.
  *
- * Updates the count of array keys processed for both local and parallel
- * scans.
+ * Caller passes the block number most recently passed to _bt_parallel_release
+ * by its backend.  Caller successfully schedules the next primitive index scan
+ * if the shared parallel state hasn't been seized since caller's backend last
+ * advanced the scan.
+ *
+ * Returns true when caller has successfully scheduled another primitive index
+ * scan.  Caller should proceed as in the serial case.  Caller's backend will
+ * often be the backend that reaches _bt_first and descends the index.  We
+ * serialize caller's array key state, so that other backends have a way of
+ * performing caller's scheduled primitive index scan as and when necessary.
+ *
+ * Returns false when another backend already seized the scan.  We'll leave it
+ * up to the other backend to schedule the next primitive index scan (possibly
+ * with array keys that are later in the scan direction).
  */
-void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+bool
+_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
+	bool		advanced = false;
+
+	Assert(so->numArrayKeys);
 
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	if (btscan->btps_scanPage == prev_scan_page &&
+		btscan->btps_pageStatus == BTPARALLEL_IDLE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
-		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
+
+		/* Serialize scan's current array keys */
+		for (int i = 0; i < so->numArrayKeys; i++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[i];
+
+			btscan->btps_arrElems[i] = array->cur_elem;
+		}
+
+		advanced = true;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
+
+	/*
+	 * Notify another worker, just in case this backend takes a while to
+	 * arrive back in _bt_first
+	 */
+	if (advanced)
+		ConditionVariableSignal(&btscan->btps_cv);
+
+	return advanced;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e3fff90d8..aadfd522d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -907,7 +907,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -917,10 +916,22 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * scan has not started, proceed to find out first leaf page in the usual
 	 * way while keeping other participating processes waiting.  If the scan
 	 * has already begun, use the page number from the shared structure.
+	 *
+	 * When a parallel scan has another primitive index scan scheduled, a
+	 * parallel worker will seize the scan for that purpose now.  This is
+	 * similar to the case where the top-level scan hasn't started.
 	 */
 	if (scan->parallel_scan != NULL)
 	{
-		status = _bt_parallel_seize(scan, &blkno);
+		status = _bt_parallel_seize(scan, &blkno, true);
+
+		/*
+		 * Initialize arrays (when _bt_parallel_seize didn't already set up
+		 * the next primitive index scan)
+		 */
+		if (so->numArrayKeys && !so->needPrimScan)
+			_bt_start_array_keys(scan, dir);
+
 		if (!status)
 			return false;
 		else if (blkno == P_NONE)
@@ -935,6 +946,16 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			goto readcomplete;
 		}
 	}
+	else if (so->numArrayKeys && !so->needPrimScan)
+	{
+		/*
+		 * First _bt_first call (for current btrescan) during serial scan.
+		 *
+		 * Initialize arrays, and the corresponding scan keys that were just
+		 * output by _bt_preprocess_keys.
+		 */
+		_bt_start_array_keys(scan, dir);
+	}
 
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
@@ -1527,11 +1548,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
-	bool		continuescanPrechecked;
-	bool		haveFirstMatch = false;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1542,19 +1562,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
 
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	pstate.dir = dir;
+	pstate.minoff = minoff;
+	pstate.maxoff = maxoff;
+	pstate.finaltup = NULL;
+	pstate.page = page;
+	pstate.offnum = InvalidOffsetNumber;
+	pstate.skip = InvalidOffsetNumber;
+	pstate.continuescan = true; /* default assumption */
+	pstate.prechecked = false;
+	pstate.firstmatch = false;
+	pstate.rechecks = 0;
+	pstate.targetdistance = 0;
+
+	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	arrayKeys = so->numArrayKeys != 0;
+
 	/* allow next page be processed by parallel worker */
 	if (scan->parallel_scan)
 	{
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, opaque->btpo_next);
+			pstate.prev_scan_page = opaque->btpo_next;
 		else
-			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
-	}
+			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
-	minoff = P_FIRSTDATAKEY(opaque);
-	maxoff = PageGetMaxOffsetNumber(page);
+		_bt_parallel_release(scan, pstate.prev_scan_page);
+	}
 
 	/*
 	 * We note the buffer's block number so that we can release the pin later.
@@ -1598,10 +1634,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * corresponding value from the last item on the page.  So checking with
 	 * the last item on the page would give a more precise answer.
 	 *
-	 * We skip this for the first page in the scan to evade the possible
-	 * slowdown of the point queries.
+	 * We skip this for the scan's first page to avoid slowing down point
+	 * queries.  We also have to avoid applying the optimization in rare cases
+	 * where it's not yet clear that the scan is at or ahead of its current
+	 * array keys.  If we're behind, but not too far behind (the start of
+	 * tuples matching the current keys is somewhere before the last item),
+	 * then the optimization is unsafe.
+	 *
+	 * Cases with multiple distinct sets of required array keys for key space
+	 * from the same leaf page can _attempt_ to use the precheck optimization,
+	 * though.  It won't work out, but there's no better way of figuring that
+	 * out than just optimistically attempting the precheck.
+	 *
+	 * The array keys safety issue is related to our reliance on _bt_first
+	 * passing us an offnum that's exactly at the beginning of where equal
+	 * tuples are to be found.  The underlying problem is that we have no
+	 * built-in ability to tell the difference between the start of required
+	 * equality matches and the end of required equality matches.  Array key
+	 * advancement within _bt_checkkeys has to act as a "_bt_first surrogate"
+	 * whenever the start of tuples matching the next set of array keys is
+	 * close to the end of tuples matching the current/last set of array keys.
 	 */
-	if (!firstPage && minoff < maxoff)
+	if (!firstPage && !so->scanBehind && minoff < maxoff)
 	{
 		ItemId		iid;
 		IndexTuple	itup;
@@ -1610,21 +1664,24 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		itup = (IndexTuple) PageGetItem(page, iid);
 
 		/*
-		 * Do the precheck.  Note that we pass the pointer to the
-		 * 'continuescanPrechecked' to the 'continuescan' argument. That will
-		 * set flag to true if all required keys are satisfied and false
-		 * otherwise.
+		 * Do the precheck, while avoiding advancing the scan's array keys
+		 * prematurely
 		 */
-		(void) _bt_checkkeys(scan, itup, indnatts, dir,
-							 &continuescanPrechecked, false, false);
-	}
-	else
-	{
-		continuescanPrechecked = false;
+		_bt_checkkeys(scan, &pstate, false, itup, indnatts);
+		pstate.prechecked = pstate.continuescan;
+		pstate.continuescan = true; /* reset */
 	}
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1649,23 +1706,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
 			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
 			 */
-			Assert((!continuescanPrechecked && haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum < pstate.skip);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
 			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1696,7 +1758,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1713,17 +1775,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
+			pstate.prechecked = false;	/* precheck didn't cover HIKEY */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1733,6 +1796,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 	else
 	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys && minoff <= maxoff && !P_LEFTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, minoff);
+
+			pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
@@ -1772,23 +1843,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			itup = (IndexTuple) PageGetItem(page, iid);
 			Assert(!BTreeTupleIsPivot(itup));
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan,
-										 continuescanPrechecked,
-										 haveFirstMatch);
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
 
 			/*
-			 * If the result of prechecking required keys was true, then in
-			 * assert-enabled builds we also recheck that the _bt_checkkeys()
-			 * result is the same.
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
 			 */
-			Assert((!continuescanPrechecked && !haveFirstMatch) ||
-				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
-												 &continuescan, false, false));
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum > pstate.skip);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
-				haveFirstMatch = true;
+				pstate.firstmatch = true;
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
@@ -1824,7 +1900,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
@@ -1970,6 +2046,33 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				   so->currPos.nextTupleOffset);
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
+
+		/*
+		 * If we're just about to start the next primitive index scan
+		 * (possible with a scan that has arrays keys, and needs to skip to
+		 * continue in the current scan direction), moreLeft/moreRight only
+		 * indicate the end of the current primitive index scan.  They must
+		 * never be taken to indicate that the top-level index scan has ended
+		 * (that would be wrong).
+		 *
+		 * We could handle this case by treating the current array keys as
+		 * markPos state.  But depending on the current array state like this
+		 * would add complexity.  Instead, we just unset markPos's copy of
+		 * moreRight or moreLeft (whichever might be affected), while making
+		 * btrestpos reset the scan's arrays to their initial scan positions.
+		 * In effect, btrestpos leaves advancing the arrays up to the first
+		 * _bt_readpage call (that takes place after it has restored markPos).
+		 * As long as the index key space is never ahead of the current array
+		 * keys, _bt_readpage handles this correctly (and efficiently).
+		 */
+		Assert(so->markPos.dir == dir);
+		if (so->needPrimScan)
+		{
+			if (ScanDirectionIsForward(dir))
+				so->markPos.moreRight = true;
+			else
+				so->markPos.moreLeft = true;
+		}
 	}
 
 	if (ScanDirectionIsForward(dir))
@@ -1981,7 +2084,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			 * Seize the scan to get the next block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno);
+			status = _bt_parallel_seize(scan, &blkno, false);
 			if (!status)
 			{
 				/* release the previous buffer, if pinned */
@@ -2013,7 +2116,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			 * Seize the scan to get the current block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno);
+			status = _bt_parallel_seize(scan, &blkno, false);
 			BTScanPosUnpinIfPinned(so->currPos);
 			if (!status)
 			{
@@ -2097,7 +2200,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno);
+				status = _bt_parallel_seize(scan, &blkno, false);
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
@@ -2193,7 +2296,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno);
+				status = _bt_parallel_seize(scan, &blkno, false);
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
@@ -2218,6 +2321,8 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
+	Assert(!so->needPrimScan);
+
 	_bt_initialize_more_data(so, dir);
 
 	if (!_bt_readnextpage(scan, blkno, dir))
@@ -2524,14 +2629,22 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
- * for scan direction
+ * _bt_initialize_more_data() -- initialize moreLeft, moreRight and scan dir
+ * from currPos
  */
 static inline void
 _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 {
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
+	so->currPos.dir = dir;
+	if (so->needPrimScan)
+	{
+		Assert(so->numArrayKeys);
+
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = true;
+		so->needPrimScan = false;
+	}
+	else if (ScanDirectionIsForward(dir))
 	{
 		so->currPos.moreLeft = false;
 		so->currPos.moreRight = true;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d50317096..598674b75 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -29,29 +29,77 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+#define LOOK_AHEAD_REQUIRED_RECHECKS 	3
+#define LOOK_AHEAD_DEFAULT_DISTANCE 	5
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *sortproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
 
+typedef struct BTScanKeyPreproc
+{
+	ScanKey		skey;
+	int			ikey;
+	int			arrayidx;
+} BTScanKeyPreproc;
+
+static void _bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+								FmgrInfo *orderproc, FmgrInfo **sortprocp);
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
-									  StrategyNumber strat,
+									  Oid elemtype, StrategyNumber strat,
 									  Datum *elems, int nelems);
-static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-									bool reverse,
-									Datum *elems, int nelems);
+static int	_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc,
+									bool reverse, Datum *elems, int nelems);
+static bool _bt_merge_arrays(IndexScanDesc scan, ScanKey skey,
+							 FmgrInfo *sortproc, bool reverse,
+							 Oid origelemtype, Oid nextelemtype,
+							 Datum *elems_orig, int *nelems_orig,
+							 Datum *elems_next, int nelems_next);
+static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
+										   ScanKey arraysk, ScanKey skey,
+										   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+										   bool *qual_ok);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+										   Datum tupdatum, bool tupnull,
+										   Datum arrdatum, ScanKey cur);
+static int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
+								   bool cur_elem_trig, ScanDirection dir,
+								   Datum tupdatum, bool tupnull,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   int32 *set_elem_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+										 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
+										 bool readpagetup, int sktrig, bool *scanBehind);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+								   int sktrig, bool sktrig_required);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
+									 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool advancenonrequired, bool prechecked, bool firstmatch,
+							  bool *continuescan, int *ikey);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
+static void _bt_check_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
+								 ScanDirection dir, int tupnatts, TupleDesc tupdesc);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -188,29 +236,55 @@ _bt_freestack(BTStack stack)
  *
  * If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
  * set up BTArrayKeyInfo info for each one that is an equality-type key.
- * Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation.  For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * Return modified scan keys as input for further, standard preprocessing.
  *
- * Note: the reason we need so->arrayKeyData, rather than just scribbling
- * on scan->keyData, is that callers are permitted to call btrescan without
- * supplying a new set of scankey data.
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value.  This eliminates
+ * all but one array element as redundant.  Similarly, we are capable of
+ * "merging together" multiple equality array keys (from two or more input
+ * scan keys) into a single output scan key containing only the intersecting
+ * array elements.  This can eliminate many redundant array elements, as well
+ * as eliminating whole array scan keys as redundant.  It can also allow us to
+ * detect contradictory quals.
+ *
+ * It is convenient for _bt_preprocess_keys caller to have to deal with no
+ * more than one equality strategy array scan key per index attribute.  We'll
+ * always be able to set things up that way when complete opfamilies are used.
+ * Eliminated array scan keys can be recognized as those that have had their
+ * sk_strategy field set to InvalidStrategy here by us.  Caller should avoid
+ * including these in the scan's so->keyData[] output array.
+ *
+ * We set the scan key references from the scan's BTArrayKeyInfo info array to
+ * offsets into the temp modified input array returned to caller.  Scans that
+ * have array keys should call _bt_preprocess_array_keys_final when standard
+ * preprocessing steps are complete.  This will convert the scan key offset
+ * references into references to the scan's so->keyData[] output scan keys.
+ *
+ * Note: the reason we need to return a temp scan key array, rather than just
+ * scribbling on scan->keyData, is that callers are permitted to call btrescan
+ * without supplying a new set of scankey data.
  */
-void
+static ScanKey
 _bt_preprocess_array_keys(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
 	int			numberOfKeys = scan->numberOfKeys;
-	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16	   *indoption = rel->rd_indoption;
 	int			numArrayKeys;
+	int			origarrayatt = InvalidAttrNumber,
+				origarraykey = -1;
+	Oid			origelemtype = InvalidOid;
 	ScanKey		cur;
-	int			i;
 	MemoryContext oldContext;
+	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+
+	Assert(numberOfKeys);
 
 	/* Quick check to see if there are any array keys */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
 		cur = &scan->keyData[i];
 		if (cur->sk_flags & SK_SEARCHARRAY)
@@ -220,20 +294,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			/* If any arrays are null as a whole, we can quit right now. */
 			if (cur->sk_flags & SK_ISNULL)
 			{
-				so->numArrayKeys = -1;
-				so->arrayKeyData = NULL;
-				return;
+				so->qual_ok = false;
+				return NULL;
 			}
 		}
 	}
 
 	/* Quit if nothing to do. */
 	if (numArrayKeys == 0)
-	{
-		so->numArrayKeys = 0;
-		so->arrayKeyData = NULL;
-		return;
-	}
+		return NULL;
 
 	/*
 	 * Make a scan-lifespan context to hold array-associated data, or reset it
@@ -249,18 +318,23 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	oldContext = MemoryContextSwitchTo(so->arrayContext);
 
 	/* Create modifiable copy of scan->keyData in the workspace context */
-	so->arrayKeyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-	memcpy(so->arrayKeyData,
-		   scan->keyData,
-		   scan->numberOfKeys * sizeof(ScanKeyData));
+	arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
+	memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
 
 	/* Allocate space for per-array data in the workspace context */
-	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
+
+	/* Allocate space for ORDER procs used to help _bt_checkkeys */
+	so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
-	for (i = 0; i < numberOfKeys; i++)
+	for (int i = 0; i < numberOfKeys; i++)
 	{
+		FmgrInfo	sortproc;
+		FmgrInfo   *sortprocp = &sortproc;
+		Oid			elemtype;
+		bool		reverse;
 		ArrayType  *arrayval;
 		int16		elmlen;
 		bool		elmbyval;
@@ -271,7 +345,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			num_nonnulls;
 		int			j;
 
-		cur = &so->arrayKeyData[i];
+		cur = &arrayKeyData[i];
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -305,10 +379,21 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		/* If there's no non-nulls, the scan qual is unsatisfiable */
 		if (num_nonnulls == 0)
 		{
-			numArrayKeys = -1;
+			so->qual_ok = false;
 			break;
 		}
 
+		/*
+		 * Determine the nominal datatype of the array elements.  We have to
+		 * support the convention that sk_subtype == InvalidOid means the
+		 * opclass input type; this is a hack to simplify life for
+		 * ScanKeyInit().
+		 */
+		elemtype = cur->sk_subtype;
+		if (elemtype == InvalidOid)
+			elemtype = rel->rd_opcintype[cur->sk_attno - 1];
+		Assert(elemtype == ARR_ELEMTYPE(arrayval));
+
 		/*
 		 * If the comparison operator is not equality, then the array qual
 		 * degenerates to a simple comparison against the smallest or largest
@@ -319,7 +404,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTLessStrategyNumber:
 			case BTLessEqualStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTGreaterStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -329,7 +414,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 			case BTGreaterEqualStrategyNumber:
 			case BTGreaterStrategyNumber:
 				cur->sk_argument =
-					_bt_find_extreme_element(scan, cur,
+					_bt_find_extreme_element(scan, cur, elemtype,
 											 BTLessStrategyNumber,
 											 elem_values, num_nonnulls);
 				continue;
@@ -339,17 +424,93 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 				break;
 		}
 
+		/*
+		 * We'll need a 3-way ORDER proc to perform binary searches for the
+		 * next matching array element.  Set that up now.
+		 *
+		 * Array scan keys with cross-type equality operators will require a
+		 * separate same-type ORDER proc for sorting their array.  Otherwise,
+		 * sortproc just points to the same proc used during binary searches.
+		 */
+		_bt_setup_array_cmp(scan, cur, elemtype,
+							&so->orderProcs[i], &sortprocp);
+
 		/*
 		 * Sort the non-null elements and eliminate any duplicates.  We must
 		 * sort in the same ordering used by the index column, so that the
-		 * successive primitive indexscans produce data in index order.
+		 * arrays can be advanced in lockstep with the scan's progress through
+		 * the index's key space.
 		 */
-		num_elems = _bt_sort_array_elements(scan, cur,
-											(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+		reverse = (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0;
+		num_elems = _bt_sort_array_elements(cur, sortprocp, reverse,
 											elem_values, num_nonnulls);
 
+		if (origarrayatt == cur->sk_attno)
+		{
+			BTArrayKeyInfo *orig = &so->arrayKeys[origarraykey];
+
+			/*
+			 * This array scan key is redundant with a previous equality
+			 * operator array scan key.  Merge the two arrays together to
+			 * eliminate contradictory non-intersecting elements (or try to).
+			 *
+			 * We merge this next array back into attribute's original array.
+			 */
+			Assert(arrayKeyData[orig->scan_key].sk_attno == cur->sk_attno);
+			Assert(arrayKeyData[orig->scan_key].sk_collation ==
+				   cur->sk_collation);
+			if (_bt_merge_arrays(scan, cur, sortprocp, reverse,
+								 origelemtype, elemtype,
+								 orig->elem_values, &orig->num_elems,
+								 elem_values, num_elems))
+			{
+				/* Successfully eliminated this array */
+				pfree(elem_values);
+
+				/*
+				 * If no intersecting elements remain in the original array,
+				 * the scan qual is unsatisfiable
+				 */
+				if (orig->num_elems == 0)
+				{
+					so->qual_ok = false;
+					break;
+				}
+
+				/*
+				 * Indicate to _bt_preprocess_keys caller that it must ignore
+				 * this scan key
+				 */
+				cur->sk_strategy = InvalidStrategy;
+				continue;
+			}
+
+			/*
+			 * Unable to merge this array with previous array due to a lack of
+			 * suitable cross-type opfamily support.  Will need to keep both
+			 * scan keys/arrays.
+			 */
+		}
+		else
+		{
+			/*
+			 * This array is the first for current index attribute.
+			 *
+			 * If it turns out to not be the last array (that is, if the next
+			 * array is redundantly applied to the same index attribute),
+			 * we'll then treat this array as the attribute's "original" array
+			 * when merging.
+			 */
+			origarrayatt = cur->sk_attno;
+			origarraykey = numArrayKeys;
+			origelemtype = elemtype;
+		}
+
 		/*
 		 * And set up the BTArrayKeyInfo data.
+		 *
+		 * Note: _bt_preprocess_array_keys_final will fix-up each array's
+		 * scan_key field later on, after so->keyData[] has been finalized.
 		 */
 		so->arrayKeys[numArrayKeys].scan_key = i;
 		so->arrayKeys[numArrayKeys].num_elems = num_elems;
@@ -360,6 +521,236 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	so->numArrayKeys = numArrayKeys;
 
 	MemoryContextSwitchTo(oldContext);
+
+	return arrayKeyData;
+}
+
+/*
+ *	_bt_preprocess_array_keys_final() -- fix up array scan key references
+ *
+ * When _bt_preprocess_array_keys performed initial array preprocessing, it
+ * set each array's array->scan_key to the array's arrayKeys[] entry offset
+ * (that also work as references into the original scan->keyData[] array).
+ * This function handles translation of the scan key references from the
+ * BTArrayKeyInfo info array, from input scan key references (to the keys in
+ * scan->keyData[]), into output references (to the keys in so->keyData[]).
+ * Caller's keyDataMap[] array tells us how to perform this remapping.
+ *
+ * Also finalizes so->orderProcs[] for the scan.  Arrays already have an ORDER
+ * proc, which might need to be repositioned to its so->keyData[]-wise offset
+ * (very much like the remapping that we apply to array->scan_key references).
+ * Non-array equality strategy scan keys (that survived preprocessing) don't
+ * yet have an so->orderProcs[] entry, so we set one for them here.
+ *
+ * Also converts single-element array scan keys into equivalent non-array
+ * equality scan keys, which decrements so->numArrayKeys.  It's possible that
+ * this will leave this new btrescan without any arrays at all.  This isn't
+ * necessary for correctness; it's just an optimization.  Non-array equality
+ * scan keys are slightly faster than equivalent array scan keys at runtime.
+ */
+static void
+_bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	int			arrayidx = 0;
+	int			last_equal_output_ikey PG_USED_FOR_ASSERTS_ONLY = -1;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys);
+
+	for (int output_ikey = 0; output_ikey < so->numberOfKeys; output_ikey++)
+	{
+		ScanKey		outkey = so->keyData + output_ikey;
+		int			input_ikey;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+
+		Assert(outkey->sk_strategy != InvalidStrategy);
+
+		if (outkey->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		input_ikey = keyDataMap[output_ikey];
+
+		Assert(last_equal_output_ikey < output_ikey);
+		Assert(last_equal_output_ikey < input_ikey);
+		last_equal_output_ikey = output_ikey;
+
+		/*
+		 * We're lazy about looking up ORDER procs for non-array keys, since
+		 * not all input keys become output keys.  Take care of it now.
+		 */
+		if (!(outkey->sk_flags & SK_SEARCHARRAY))
+		{
+			Oid			elemtype;
+
+			/* No need for an ORDER proc given an IS NULL scan key */
+			if (outkey->sk_flags & SK_SEARCHNULL)
+				continue;
+
+			elemtype = outkey->sk_subtype;
+			if (elemtype == InvalidOid)
+				elemtype = rel->rd_opcintype[outkey->sk_attno - 1];
+
+			_bt_setup_array_cmp(scan, outkey, elemtype,
+								&so->orderProcs[output_ikey], NULL);
+			continue;
+		}
+
+		/*
+		 * Reorder existing array scan key so->orderProcs[] entries.
+		 *
+		 * Doing this in-place is safe because preprocessing is required to
+		 * output all equality strategy scan keys in original input order
+		 * (among each group of entries against the same index attribute).
+		 * This is also the order that the arrays themselves appear in.
+		 */
+		so->orderProcs[output_ikey] = so->orderProcs[input_ikey];
+
+		/* Fix-up array->scan_key references for arrays */
+		for (; arrayidx < so->numArrayKeys; arrayidx++)
+		{
+			BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
+
+			Assert(array->num_elems > 0);
+
+			if (array->scan_key == input_ikey)
+			{
+				/* found it */
+				array->scan_key = output_ikey;
+				found = true;
+
+				/*
+				 * Transform array scan keys that have exactly 1 element
+				 * remaining (following all prior preprocessing) into
+				 * equivalent non-array scan keys.
+				 */
+				if (array->num_elems == 1)
+				{
+					outkey->sk_flags &= ~SK_SEARCHARRAY;
+					outkey->sk_argument = array->elem_values[0];
+					so->numArrayKeys--;
+
+					/* If we're out of array keys, we can quit right away */
+					if (so->numArrayKeys == 0)
+						return;
+
+					/* Shift other arrays forward */
+					memmove(array, array + 1,
+							sizeof(BTArrayKeyInfo) *
+							(so->numArrayKeys - arrayidx));
+
+					/*
+					 * Don't increment arrayidx (there was an entry that was
+					 * just shifted forward to the offset at arrayidx, which
+					 * will still need to be matched)
+					 */
+				}
+				else
+				{
+					/* Match found, so done with this array */
+					arrayidx++;
+				}
+
+				break;
+			}
+		}
+
+		Assert(found);
+	}
+}
+
+/*
+ * _bt_setup_array_cmp() -- Set up array comparison functions
+ *
+ * Sets ORDER proc in caller's orderproc argument, which is used during binary
+ * searches of arrays during the index scan.  Also sets a same-type ORDER proc
+ * in caller's *sortprocp argument, which is used when sorting the array.
+ *
+ * Preprocessing calls here with all equality strategy scan keys (when scan
+ * uses equality array keys), including those not associated with any array.
+ * See _bt_advance_array_keys for an explanation of why it'll need to treat
+ * simple scalar equality scan keys as degenerate single element arrays.
+ *
+ * Caller should pass an orderproc pointing to space that'll store the ORDER
+ * proc for the scan, and a *sortprocp pointing to its own separate space.
+ * When calling here for a non-array scan key, sortprocp arg should be NULL.
+ *
+ * In the common case where we don't need to deal with cross-type operators,
+ * only one ORDER proc is actually required by caller.  We'll set *sortprocp
+ * to point to the same memory that caller's orderproc continues to point to.
+ * Otherwise, *sortprocp will continue to point to caller's own space.  Either
+ * way, *sortprocp will point to a same-type ORDER proc (since that's the only
+ * safe way to sort/deduplicate the array associated with caller's scan key).
+ */
+static void
+_bt_setup_array_cmp(IndexScanDesc scan, ScanKey skey, Oid elemtype,
+					FmgrInfo *orderproc, FmgrInfo **sortprocp)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	RegProcedure cmp_proc;
+	Oid			opcintype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
+
+	/*
+	 * If scankey operator is not a cross-type comparison, we can use the
+	 * cached comparison function; otherwise gotta look it up in the catalogs
+	 */
+	if (elemtype == opcintype)
+	{
+		/* Set same-type ORDER procs for caller */
+		*orderproc = *index_getprocinfo(rel, skey->sk_attno, BTORDER_PROC);
+		if (sortprocp)
+			*sortprocp = orderproc;
+
+		return;
+	}
+
+	/*
+	 * Look up the appropriate cross-type comparison function in the opfamily.
+	 *
+	 * Use the opclass input type as the left hand arg type, and the array
+	 * element type as the right hand arg type (since binary searches use an
+	 * index tuple's attribute value to search for a matching array element).
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but only in cases where it's quite likely that _bt_first
+	 * would fail in just the same way (had we not failed before it could).
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 opcintype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, opcintype, elemtype, skey->sk_attno,
+			 RelationGetRelationName(rel));
+
+	/* Set cross-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+
+	/* Done if caller doesn't actually have an array they'll need to sort */
+	if (!sortprocp)
+		return;
+
+	/*
+	 * Look up the appropriate same-type comparison function in the opfamily.
+	 *
+	 * Note: it's possible that this would fail, if the opfamily is
+	 * incomplete, but it seems quite unlikely that an opfamily would omit
+	 * non-cross-type comparison procs for any datatype that it supports at
+	 * all.
+	 */
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 elemtype, elemtype, BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+			 BTORDER_PROC, elemtype, elemtype,
+			 skey->sk_attno, RelationGetRelationName(rel));
+
+	/* Set same-type ORDER proc for caller */
+	fmgr_info_cxt(cmp_proc, *sortprocp, so->arrayContext);
 }
 
 /*
@@ -370,27 +761,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
  * least element, or BTGreaterStrategyNumber to get the greatest.
  */
 static Datum
-_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
+_bt_find_extreme_element(IndexScanDesc scan, ScanKey skey, Oid elemtype,
 						 StrategyNumber strat,
 						 Datum *elems, int nelems)
 {
 	Relation	rel = scan->indexRelation;
-	Oid			elemtype,
-				cmp_op;
+	Oid			cmp_op;
 	RegProcedure cmp_proc;
 	FmgrInfo	flinfo;
 	Datum		result;
 	int			i;
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
 	/*
 	 * Look up the appropriate comparison operator in the opfamily.
 	 *
@@ -399,6 +780,8 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	 * non-cross-type comparison operators for any datatype that it supports
 	 * at all.
 	 */
+	Assert(skey->sk_strategy != BTEqualStrategyNumber);
+	Assert(OidIsValid(elemtype));
 	cmp_op = get_opfamily_member(rel->rd_opfamily[skey->sk_attno - 1],
 								 elemtype,
 								 elemtype,
@@ -433,50 +816,21 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
  * The array elements are sorted in-place, and the new number of elements
  * after duplicate removal is returned.
  *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics.  If reverse is true, we sort in descending order.
+ * skey identifies the index column whose opfamily determines the comparison
+ * semantics, and sortproc is a corresponding ORDER proc.  If reverse is true,
+ * we sort in descending order.
  */
 static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
-						bool reverse,
+_bt_sort_array_elements(ScanKey skey, FmgrInfo *sortproc, bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.sortproc = sortproc;
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -487,6 +841,233 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 					   _bt_compare_array_elements, &cxt);
 }
 
+/*
+ * _bt_merge_arrays() -- merge next array's elements into an original array
+ *
+ * Called when preprocessing encounters a pair of array equality scan keys,
+ * both against the same index attribute (during initial array preprocessing).
+ * Merging reorganizes caller's original array (the left hand arg) in-place,
+ * without ever copying elements from one array into the other. (Mixing the
+ * elements together like this would be wrong, since they don't necessarily
+ * use the same underlying element type, despite all the other similarities.)
+ *
+ * Both arrays must have already been sorted and deduplicated by calling
+ * _bt_sort_array_elements.  sortproc is the same-type ORDER proc that was
+ * just used to sort and deduplicate caller's "next" array.  We'll usually be
+ * able to reuse that order PROC to merge the arrays together now.  If not,
+ * then we'll perform a separate ORDER proc lookup.
+ *
+ * If the opfamily doesn't supply a complete set of cross-type ORDER procs we
+ * may not be able to determine which elements are contradictory.  If we have
+ * the required ORDER proc then we return true (and validly set *nelems_orig),
+ * guaranteeing that at least the next array can be considered redundant.  We
+ * return false if the required comparisons cannot not be made (caller must
+ * keep both arrays when this happens).
+ */
+static bool
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, FmgrInfo *sortproc,
+				 bool reverse, Oid origelemtype, Oid nextelemtype,
+				 Datum *elems_orig, int *nelems_orig,
+				 Datum *elems_next, int nelems_next)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSortArrayContext cxt;
+	int			nelems_orig_start = *nelems_orig,
+				nelems_orig_merged = 0;
+	FmgrInfo   *mergeproc = sortproc;
+	FmgrInfo	crosstypeproc;
+
+	Assert(skey->sk_strategy == BTEqualStrategyNumber);
+	Assert(OidIsValid(origelemtype) && OidIsValid(nextelemtype));
+
+	if (origelemtype != nextelemtype)
+	{
+		RegProcedure cmp_proc;
+
+		/*
+		 * Cross-array-element-type merging is required, so can't just reuse
+		 * sortproc when merging
+		 */
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+									 origelemtype, nextelemtype, BTORDER_PROC);
+		if (!RegProcedureIsValid(cmp_proc))
+		{
+			/* Can't make the required comparisons */
+			return false;
+		}
+
+		/* We have all we need to determine redundancy/contradictoriness */
+		mergeproc = &crosstypeproc;
+		fmgr_info_cxt(cmp_proc, mergeproc, so->arrayContext);
+	}
+
+	cxt.sortproc = mergeproc;
+	cxt.collation = skey->sk_collation;
+	cxt.reverse = reverse;
+
+	for (int i = 0, j = 0; i < nelems_orig_start && j < nelems_next;)
+	{
+		Datum	   *oelem = elems_orig + i,
+				   *nelem = elems_next + j;
+		int			res = _bt_compare_array_elements(oelem, nelem, &cxt);
+
+		if (res == 0)
+		{
+			elems_orig[nelems_orig_merged++] = *oelem;
+			i++;
+			j++;
+		}
+		else if (res < 0)
+			i++;
+		else					/* res > 0 */
+			j++;
+	}
+
+	*nelems_orig = nelems_orig_merged;
+
+	return true;
+}
+
+/*
+ * Compare an array scan key to a scalar scan key, eliminating contradictory
+ * array elements such that the scalar scan key becomes redundant.
+ *
+ * Array elements can be eliminated as contradictory when excluded by some
+ * other operator on the same attribute.  For example, with an index scan qual
+ * "WHERE a IN (1, 2, 3) AND a < 2", all array elements except the value "1"
+ * are eliminated, and the < scan key is eliminated as redundant.  Cases where
+ * every array element is eliminated by a redundant scalar scan key have an
+ * unsatisfiable qual, which we handle by setting *qual_ok=false for caller.
+ *
+ * If the opfamily doesn't supply a complete set of cross-type ORDER procs we
+ * may not be able to determine which elements are contradictory.  If we have
+ * the required ORDER proc then we return true (and validly set *qual_ok),
+ * guaranteeing that at least the scalar scan key can be considered redundant.
+ * We return false if the comparison could not be made (caller must keep both
+ * scan keys when this happens).
+ */
+static bool
+_bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+							   FmgrInfo *orderproc, BTArrayKeyInfo *array,
+							   bool *qual_ok)
+{
+	Relation	rel = scan->indexRelation;
+	Oid			opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
+	int			cmpresult = 0,
+				cmpexact = 0,
+				matchelem,
+				new_nelems = 0;
+	FmgrInfo	crosstypeproc;
+	FmgrInfo   *orderprocp = orderproc;
+
+	Assert(arraysk->sk_attno == skey->sk_attno);
+	Assert(array->num_elems > 0);
+	Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
+		   arraysk->sk_strategy == BTEqualStrategyNumber);
+	Assert(!(skey->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
+	Assert(!(skey->sk_flags & SK_SEARCHARRAY) ||
+		   skey->sk_strategy != BTEqualStrategyNumber);
+
+	/*
+	 * _bt_binsrch_array_skey is designed to search scan key arrays using
+	 * datums of whatever type the relevant rel opclass uses (on-disk type).
+	 *
+	 * We can reuse the array's ORDER proc whenever the non-array scan key's
+	 * type is a match for the corresponding attribute's input opclass type.
+	 * Otherwise, we have to do another ORDER proc lookup so that our call to
+	 * _bt_binsrch_array_skey applies the correct comparator.
+	 *
+	 * Note: we have to support the convention that sk_subtype == InvalidOid
+	 * means the opclass input type; this is a hack to simplify life for
+	 * ScanKeyInit().
+	 */
+	if (skey->sk_subtype != opcintype && skey->sk_subtype != InvalidOid)
+	{
+		RegProcedure cmp_proc;
+		Oid			arraysk_elemtype;
+
+		/*
+		 * Need an ORDER proc lookup to detect redundancy/contradictoriness
+		 * with this pair of scankeys.
+		 *
+		 * Scalar scan key's argument will be passed to _bt_compare_array_skey
+		 * as its tupdatum/lefthand argument (rhs arg is for array elements).
+		 */
+		arraysk_elemtype = arraysk->sk_subtype;
+		if (arraysk_elemtype == InvalidOid)
+			arraysk_elemtype = rel->rd_opcintype[arraysk->sk_attno - 1];
+		cmp_proc = get_opfamily_proc(rel->rd_opfamily[arraysk->sk_attno - 1],
+									 skey->sk_subtype, arraysk_elemtype,
+									 BTORDER_PROC);
+		if (!RegProcedureIsValid(cmp_proc))
+		{
+			/* Can't make the comparison */
+			*qual_ok = false;	/* suppress compiler warnings */
+			return false;
+		}
+
+		/* We have all we need to determine redundancy/contradictoriness */
+		orderprocp = &crosstypeproc;
+		fmgr_info(cmp_proc, orderprocp);
+	}
+
+	matchelem = _bt_binsrch_array_skey(orderprocp, false,
+									   NoMovementScanDirection,
+									   skey->sk_argument, false, array,
+									   arraysk, &cmpresult);
+
+	switch (skey->sk_strategy)
+	{
+		case BTLessStrategyNumber:
+			cmpexact = 1;		/* exclude exact match, if any */
+			/* FALL THRU */
+		case BTLessEqualStrategyNumber:
+			if (cmpresult >= cmpexact)
+				matchelem++;
+			/* Resize, keeping elements from the start of the array */
+			new_nelems = matchelem;
+			break;
+		case BTEqualStrategyNumber:
+			if (cmpresult != 0)
+			{
+				/* qual is unsatisfiable */
+				new_nelems = 0;
+			}
+			else
+			{
+				/* Shift matching element to the start of the array, resize */
+				array->elem_values[0] = array->elem_values[matchelem];
+				new_nelems = 1;
+			}
+			break;
+		case BTGreaterEqualStrategyNumber:
+			cmpexact = 1;		/* include exact match, if any */
+			/* FALL THRU */
+		case BTGreaterStrategyNumber:
+			if (cmpresult >= cmpexact)
+				matchelem++;
+			/* Shift matching elements to the start of the array, resize */
+			new_nelems = array->num_elems - matchelem;
+			memmove(array->elem_values, array->elem_values + matchelem,
+					sizeof(Datum) * new_nelems);
+			break;
+		default:
+			elog(ERROR, "unrecognized StrategyNumber: %d",
+				 (int) skey->sk_strategy);
+			break;
+	}
+
+	Assert(new_nelems >= 0);
+	Assert(new_nelems <= array->num_elems);
+
+	array->num_elems = new_nelems;
+	*qual_ok = new_nelems > 0;
+
+	return true;
+}
+
 /*
  * qsort_arg comparator for sorting array elements
  */
@@ -498,7 +1079,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->sortproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -506,11 +1087,233 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ *		This routine returns:
+ *			<0 if tupdatum < arrdatum;
+ *			 0 if tupdatum == arrdatum;
+ *			>0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+					   Datum tupdatum, bool tupnull,
+					   Datum arrdatum, ScanKey cur)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (tupnull)				/* NULL tupdatum */
+	{
+		if (cur->sk_flags & SK_ISNULL)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NULL "<" NOT_NULL */
+		else
+			result = 1;			/* NULL ">" NOT_NULL */
+	}
+	else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NOT_NULL ">" NULL */
+		else
+			result = -1;		/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index tuple
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 tupdatum, arrdatum));
+
+		/*
+		 * We flip the sign by following the obvious rule: flip whenever the
+		 * column is a DESC column.
+		 *
+		 * _bt_compare does it the wrong way around (flip when *ASC*) in order
+		 * to compensate for passing its orderproc arguments backwards.  We
+		 * don't need to play these games because we find it natural to pass
+		 * tupdatum as the left value (and arrdatum as the right value).
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers.  Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *set_elem_result to the result _bt_compare_array_skey returned
+ * when we used it to compare the matching array element to tupdatum/tupnull.
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this array's
+ * scan key, and that the array is for a required scan key.  We can apply this
+ * information to find the next matching array element in the current scan
+ * direction using far fewer comparisons (fewer on average, compared to naive
+ * binary search).  This scheme takes advantage of an important property of
+ * required arrays: required arrays always advance in lockstep with the index
+ * scan's progress through the index's key space.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+					   bool cur_elem_trig, ScanDirection dir,
+					   Datum tupdatum, bool tupnull,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   int32 *set_elem_result)
+{
+	int			low_elem = 0,
+				mid_elem = -1,
+				high_elem = array->num_elems - 1,
+				result = 0;
+	Datum		arrdatum;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur_elem_trig)
+	{
+		Assert(!ScanDirectionIsNoMovement(dir));
+		Assert(cur->sk_flags & SK_BT_REQFWD);
+
+		/*
+		 * When the scan key that triggered array advancement is a required
+		 * array scan key, it is now certain that the current array element
+		 * (plus all prior elements relative to the current scan direction)
+		 * cannot possibly be at or ahead of the corresponding tuple value.
+		 * (_bt_checkkeys must have called _bt_tuple_before_array_skeys, which
+		 * makes sure this is true as a condition of advancing the arrays.)
+		 *
+		 * This makes it safe to exclude array elements up to and including
+		 * the former-current array element from our search.
+		 *
+		 * Separately, when array advancement was triggered by a required scan
+		 * key, the array element immediately after the former-current element
+		 * is often either an exact tupdatum match, or a "close by" near-match
+		 * (a near-match tupdatum is one whose key space falls _between_ the
+		 * former-current and new-current array elements).  We'll detect both
+		 * cases via an optimistic comparison of the new search lower bound
+		 * (or new search upper bound in the case of backwards scans).
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			low_elem = array->cur_elem + 1; /* old cur_elem exhausted */
+
+			/* Compare prospective new cur_elem (also the new lower bound) */
+			if (high_elem >= low_elem)
+			{
+				arrdatum = array->elem_values[low_elem];
+				result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+												arrdatum, cur);
+
+				if (result <= 0)
+				{
+					/* Optimistic comparison optimization worked out */
+					*set_elem_result = result;
+					return low_elem;
+				}
+				mid_elem = low_elem;
+				low_elem++;		/* this cur_elem exhausted, too */
+			}
+
+			if (high_elem < low_elem)
+			{
+				/* Caller needs to perform "beyond end" array advancement */
+				*set_elem_result = 1;
+				return high_elem;
+			}
+		}
+		else
+		{
+			high_elem = array->cur_elem - 1;	/* old cur_elem exhausted */
+
+			/* Compare prospective new cur_elem (also the new upper bound) */
+			if (high_elem >= low_elem)
+			{
+				arrdatum = array->elem_values[high_elem];
+				result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+												arrdatum, cur);
+
+				if (result >= 0)
+				{
+					/* Optimistic comparison optimization worked out */
+					*set_elem_result = result;
+					return high_elem;
+				}
+				mid_elem = high_elem;
+				high_elem--;	/* this cur_elem exhausted, too */
+			}
+
+			if (high_elem < low_elem)
+			{
+				/* Caller needs to perform "beyond end" array advancement */
+				*set_elem_result = -1;
+				return low_elem;
+			}
+		}
+	}
+
+	while (high_elem > low_elem)
+	{
+		mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										arrdatum, cur);
+
+		if (result == 0)
+		{
+			/*
+			 * It's safe to quit as soon as we see an equal array element.
+			 * This often saves an extra comparison or two...
+			 */
+			low_elem = mid_elem;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ...but our caller also cares about how its searched-for tuple datum
+	 * compares to the low_elem datum.  Must always set *set_elem_result with
+	 * the result of that comparison specifically.
+	 */
+	if (low_elem != mid_elem)
+		result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+										array->elem_values[low_elem], cur);
+
+	*set_elem_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
  * Set up the cur_elem counters and fill in the first sk_argument value for
- * each array scankey.  We can't do this until we know the scan direction.
+ * each array scankey.
  */
 void
 _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -518,159 +1321,1164 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			i;
 
+	Assert(so->numArrayKeys);
+	Assert(so->qual_ok);
+
 	for (i = 0; i < so->numArrayKeys; i++)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 
 		Assert(curArrayKey->num_elems > 0);
+		Assert(skey->sk_flags & SK_SEARCHARRAY);
+
 		if (ScanDirectionIsBackward(dir))
 			curArrayKey->cur_elem = curArrayKey->num_elems - 1;
 		else
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
+	so->scanBehind = false;
 }
 
 /*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction.  When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
  *
  * Returns true if there is another set of values to consider, false if not.
  * On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array remains at its final element for scan direction).
  */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
 
 	/*
 	 * We must advance the last array key most quickly, since it will
 	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
+	 * qualifications
 	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	for (int i = so->numArrayKeys - 1; i >= 0; i--)
 	{
 		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		ScanKey		skey = &so->keyData[curArrayKey->scan_key];
 		int			cur_elem = curArrayKey->cur_elem;
 		int			num_elems = curArrayKey->num_elems;
+		bool		rolled = false;
 
-		if (ScanDirectionIsBackward(dir))
+		if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
 		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = 0;
+			rolled = true;
 		}
-		else
+		else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
 		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
+			cur_elem = num_elems - 1;
+			rolled = true;
 		}
 
 		curArrayKey->cur_elem = cur_elem;
 		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
+		if (!rolled)
+			return true;
 
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
+		/* Need to advance next array key, if any */
+	}
 
 	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
+	 * The array keys are now exhausted.  (There isn't actually a distinct
+	 * state that represents array exhaustion, since index scans don't always
+	 * end after btgettuple returns "false".)
+	 *
+	 * Restore the array keys to the state they were in immediately before we
+	 * were called.  This ensures that the arrays only ever ratchet in the
+	 * current scan direction.  Without this, scans would overlook matching
+	 * tuples if and when the scan's direction was subsequently reversed.
 	 */
-	if (!found)
-		so->arraysStarted = false;
+	_bt_start_array_keys(scan, -dir);
 
-	return found;
+	return false;
 }
 
 /*
- * _bt_mark_array_keys() -- Handle array keys during btmarkpos
+ * _bt_rewind_nonrequired_arrays() -- Rewind non-required arrays
  *
- * Save the current state of the array keys as the "mark" position.
+ * Called when _bt_advance_array_keys decides to start a new primitive index
+ * scan on the basis of the current scan position being before the position
+ * that _bt_first is capable of repositioning the scan to by applying an
+ * inequality operator required in the opposite-to-scan direction only.
+ *
+ * Although equality strategy scan keys (for both arrays and non-arrays alike)
+ * are either marked required in both directions or in neither direction,
+ * there is a sense in which non-required arrays behave like required arrays.
+ * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
+ * the scan key on "c" is non-required, but nevertheless enables positioning
+ * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
+ * first descent of the tree by _bt_first.  Later on, there could also be a
+ * second descent, that places the scan right before tuples >= "(200, 3, 5)".
+ * _bt_first must never be allowed to build an insertion scan key whose "c"
+ * entry is set to a value other than 5, the "c" array's first element/value.
+ * (Actually, it's the first in the current scan direction.  This example uses
+ * a forward scan.)
+ *
+ * Calling here resets the array scan key elements for the scan's non-required
+ * arrays.  This is strictly necessary for correctness in a subset of cases
+ * involving "required in opposite direction"-triggered primitive index scans.
+ * Not all callers are at risk of _bt_first using a non-required array like
+ * this, but advancement always resets the arrays when another primitive scan
+ * is scheduled, just to keep things simple.  Array advancement even makes
+ * sure to reset non-required arrays during scans that have no inequalities.
+ * (Advancement still won't call here when there are no inequalities, though
+ * that's just because it's all handled indirectly instead.)
+ *
+ * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
+ * everybody got this right.
  */
-void
-_bt_mark_array_keys(IndexScanDesc scan)
+static void
+_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	int			i;
+	int			arrayidx = 0;
 
-	for (i = 0; i < so->numArrayKeys; i++)
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
 
-		curArrayKey->mark_elem = curArrayKey->cur_elem;
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		Assert(array->scan_key == ikey);
+
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+		{
+			array->cur_elem = first_elem_dir;
+			cur->sk_argument = array->elem_values[first_elem_dir];
+		}
 	}
 }
 
 /*
- * _bt_restore_array_keys() -- Handle array keys during btrestrpos
+ * _bt_tuple_before_array_skeys() -- determine if tuple advances array keys
  *
- * Restore the array keys to where they were when the mark was set.
+ * We always compare the tuple using the current array keys (which we assume
+ * are already set in so->keyData[]).  readpagetup indicates if tuple is the
+ * scan's current _bt_readpage-wise tuple.
+ *
+ * readpagetup callers must only call here when _bt_check_compare already set
+ * continuescan=false.  We help these callers deal with _bt_check_compare's
+ * inability to distinguishing between the < and > cases (it uses equality
+ * operator scan keys, whereas we use 3-way ORDER procs).  These callers pass
+ * a _bt_check_compare-set sktrig value that indicates which scan key
+ * triggered the call (!readpagetup callers just pass us sktrig=0 instead).
+ * This information allows us to avoid wastefully checking earlier scan keys
+ * that were already deemed to have been satisfied inside _bt_check_compare.
+ *
+ * Returns false when caller's tuple is >= the current required equality scan
+ * keys (or <=, in the case of backwards scans).  This happens to readpagetup
+ * callers when the scan has reached the point of needing its array keys
+ * advanced; caller will need to advance required and non-required arrays at
+ * scan key offsets >= sktrig, plus scan keys < sktrig iff sktrig rolls over.
+ * (When we return false to readpagetup callers, tuple can only be == current
+ * required equality scan keys when caller's sktrig indicates that the arrays
+ * need to be advanced due to an unsatisfied required inequality key trigger.)
+ *
+ * Returns true when caller passes a tuple that is < the current set of
+ * equality keys for the most significant non-equal required scan key/column
+ * (or > the keys, during backwards scans).  This happens to readpagetup
+ * callers when tuple is still before the start of matches for the scan's
+ * required equality strategy scan keys.  (sktrig can't have indicated that an
+ * inequality strategy scan key wasn't satisfied in _bt_check_compare when we
+ * return true.  In fact, we automatically return false when passed such an
+ * inequality sktrig by readpagetup callers -- _bt_check_compare's initial
+ * continuescan=false doesn't really need to be confirmed here by us.)
+ *
+ * readpagetup callers shouldn't call here for unsatisfied non-required array
+ * scan keys, since _bt_check_compare is capable of handling those on its own
+ * (non-required array advancement can never roll over to higher order arrays,
+ * and so never affects required arrays, and so can't affect our answer).
+ *
+ * !readpagetup callers optionally pass us *scanBehind, which tracks whether
+ * any missing truncated attributes might have affected array advancement
+ * (compared to what would happen if it was shown the first non-pivot tuple on
+ * the page to the right of caller's finaltup/high key tuple instead).  It's
+ * only possible that we'll set *scanBehind to true when caller passes us a
+ * pivot tuple (with truncated attributes) that we return false for.
  */
-void
-_bt_restore_array_keys(IndexScanDesc scan)
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
+							 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
+							 bool readpagetup, int sktrig, bool *scanBehind)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		changed = false;
-	int			i;
 
-	/* Restore each array key to its position when the mark was set */
-	for (i = 0; i < so->numArrayKeys; i++)
+	Assert(so->numArrayKeys);
+	Assert(so->numberOfKeys);
+	Assert(sktrig == 0 || readpagetup);
+	Assert(!readpagetup || scanBehind == NULL);
+
+	if (scanBehind)
+		*scanBehind = false;
+
+	for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
 	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			mark_elem = curArrayKey->mark_elem;
+		ScanKey		cur = so->keyData + ikey;
+		Datum		tupdatum;
+		bool		tupnull;
+		int32		result;
 
-		if (curArrayKey->cur_elem != mark_elem)
+		/* readpagetup calls require one ORDER proc comparison (at most) */
+		Assert(!readpagetup || ikey == sktrig);
+
+		/*
+		 * Once we reach a non-required scan key, we're completely done.
+		 *
+		 * Note: we deliberately don't consider the scan direction here.
+		 * _bt_advance_array_keys caller requires that we track *scanBehind
+		 * without concern for scan direction.
+		 */
+		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) == 0)
 		{
-			curArrayKey->cur_elem = mark_elem;
-			skey->sk_argument = curArrayKey->elem_values[mark_elem];
-			changed = true;
+			Assert(!readpagetup);
+			Assert(ikey > sktrig || ikey == 0);
+			return false;
+		}
+
+		if (cur->sk_attno > tupnatts)
+		{
+			Assert(!readpagetup);
+
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's equality constraint
+			 * scan keys (but set *scanBehind to let interested callers know
+			 * that a truncated attribute might have affected our answer).
+			 */
+			if (scanBehind)
+				*scanBehind = true;
+
+			return false;
+		}
+
+		/*
+		 * Deal with inequality strategy scan keys that _bt_check_compare set
+		 * continuescan=false for
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+		{
+			/*
+			 * When _bt_check_compare indicated that a required inequality
+			 * scan key wasn't satisfied, there's no need to verify anything;
+			 * caller always calls _bt_advance_array_keys with this sktrig.
+			 */
+			if (readpagetup)
+				return false;
+
+			/*
+			 * Otherwise we can't give up, since we must check all required
+			 * scan keys (required in either direction) in order to correctly
+			 * track *scanBehind for caller
+			 */
+			continue;
+		}
+
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		result = _bt_compare_array_skey(&so->orderProcs[ikey],
+										tupdatum, tupnull,
+										cur->sk_argument, cur);
+
+		/*
+		 * Does this comparison indicate that caller must _not_ advance the
+		 * scan's arrays just yet?
+		 */
+		if ((ScanDirectionIsForward(dir) && result < 0) ||
+			(ScanDirectionIsBackward(dir) && result > 0))
+			return true;
+
+		/*
+		 * Does this comparison indicate that caller should now advance the
+		 * scan's arrays?  (Must be if we get here during a readpagetup call.)
+		 */
+		if (readpagetup || result != 0)
+		{
+			Assert(result != 0);
+			return false;
+		}
+
+		/*
+		 * Inconclusive -- need to check later scan keys, too.
+		 *
+		 * This must be a finaltup precheck, or a call made from an assertion.
+		 */
+		Assert(result == 0);
+	}
+
+	Assert(!readpagetup);
+
+	return false;
+}
+
+/*
+ * _bt_start_prim_scan() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended.  Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys,
+ * after _bt_first or _bt_next return false.
+ */
+bool
+_bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+
+	/* scanBehind flag doesn't persist across primitive index scans - reset */
+	so->scanBehind = false;
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * advancing the scan's array keys when it allows the primitive index scan
+	 * to find nearby matching tuples (or when it eliminates ranges of array
+	 * key space that can't possibly be satisfied by any index tuple).
+	 *
+	 * _bt_checkkeys sets a simple flag variable to schedule another primitive
+	 * index scan.  The flag tells us what to do.
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys.  There are
+	 * various cases where that won't happen.  For example, if the index is
+	 * completely empty, then _bt_first won't call _bt_readpage/_bt_checkkeys.
+	 * We also don't expect a call to _bt_checkkeys during searches for a
+	 * non-existent value that happens to be lower/higher than any existing
+	 * value in the index.
+	 *
+	 * We don't require special handling for these cases -- we don't need to
+	 * be explicitly instructed to _not_ perform another primitive index scan.
+	 * It's up to code under the control of _bt_first to always set the flag
+	 * when another primitive index scan will be required.
+	 *
+	 * This works correctly, even with the tricky cases listed above, which
+	 * all involve access to leaf pages "near the boundaries of the key space"
+	 * (whether it's from a leftmost/rightmost page, or an imaginary empty
+	 * leaf root page).  If _bt_checkkeys cannot be reached by a primitive
+	 * index scan for one set of array keys, then it also won't be reached for
+	 * any later set ("later" in terms of the direction that we scan the index
+	 * and advance the arrays).  The array keys won't have advanced in these
+	 * cases, but that's the correct behavior (even _bt_advance_array_keys
+	 * won't always advance the arrays at the point they become "exhausted").
+	 */
+	if (so->needPrimScan)
+	{
+		Assert(_bt_verify_arrays_bt_first(scan, dir));
+
+		/*
+		 * Flag was set -- must call _bt_first again, which will reset the
+		 * scan's needPrimScan flag
+		 */
+		return true;
+	}
+
+	/* The top-level index scan ran out of tuples in this scan direction */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * The scan always gets a new qual as a consequence of calling here (except
+ * when we determine that the top-level scan has run out of matching tuples).
+ * All later _bt_check_compare calls also use the same new qual that was first
+ * used here (at least until the next call here advances the keys once again).
+ * It's convenient to structure _bt_check_compare rechecks of caller's tuple
+ * (using the new qual) as one the steps of advancing the scan's array keys,
+ * so this function works as a wrapper around _bt_check_compare.
+ *
+ * Like _bt_check_compare, we'll set pstate.continuescan on behalf of the
+ * caller, and return a boolean indicating if caller's tuple satisfies the
+ * scan's new qual.  But unlike _bt_check_compare, we set so->needPrimScan
+ * when we set continuescan=false, indicating if a new primitive index scan
+ * has been scheduled (otherwise, the top-level scan has run out of tuples in
+ * the current scan direction).
+ *
+ * Caller must use _bt_tuple_before_array_skeys to determine if the current
+ * place in the scan is >= the current array keys _before_ calling here.
+ * We're responsible for ensuring that caller's tuple is <= the newly advanced
+ * required array keys once we return.  We try to find an exact match, but
+ * failing that we'll advance the array keys to whatever set of array elements
+ * comes next in the key space for the current scan direction.  Required array
+ * keys "ratchet forwards" (or backwards).  They can only advance as the scan
+ * itself advances through the index/key space.
+ *
+ * (The rules are the same for backwards scans, except that the operators are
+ * flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=.  In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * We also deal with "advancing" non-required arrays here.  Callers whose
+ * sktrig scan key is non-required specify sktrig_required=false.  These calls
+ * are the only exception to the general rule about always advancing the
+ * required array keys (the scan may not even have a required array).  These
+ * callers should just pass a NULL pstate (since there is never any question
+ * of stopping the scan).  No call to _bt_tuple_before_array_skeys is required
+ * ahead of these calls (it's already clear that any required scan keys must
+ * be satisfied by caller's tuple).
+ *
+ * Note that we deal with non-array required equality strategy scan keys as
+ * degenerate single element arrays here.  Obviously, they can never really
+ * advance in the way that real arrays can, but they must still affect how we
+ * advance real array scan keys (exactly like true array equality scan keys).
+ * We have to keep around a 3-way ORDER proc for these (using the "=" operator
+ * won't do), since in general whether the tuple is < or > _any_ unsatisfied
+ * required equality key influences how the scan's real arrays must advance.
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing required array keys (and other required equality keys) are already
+ * an exact match for every corresponding value from caller's tuple.  We must
+ * do this for inequalities that _bt_check_compare set continuescan=false for.
+ * They'll advance the array keys here, just like any other scan key that
+ * _bt_check_compare stops on.  (This can even happen _after_ we advance the
+ * array keys, in which case we'll advance the array keys a second time.  That
+ * way _bt_checkkeys caller always has its required arrays advance to the
+ * maximum possible extent that its tuple will allow.)
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+					   int sktrig, bool sktrig_required)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate ? pstate->dir : ForwardScanDirection;
+	int			arrayidx = 0;
+	bool		beyond_end_advance = false,
+				has_required_opposite_direction_only = false,
+				oppodir_inequality_sktrig = false,
+				all_required_satisfied = true,
+				all_satisfied = true;
+
+	if (sktrig_required)
+	{
+		/*
+		 * Precondition array state assertions
+		 */
+		Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
+											 tupnatts, false, 0, NULL));
+
+		so->scanBehind = false; /* reset */
+
+		/*
+		 * Required scan key wasn't satisfied, so required arrays will have to
+		 * advance.  Invalidate page-level state that tracks whether the
+		 * scan's required-in-opposite-direction-only keys are known to be
+		 * satisfied by page's remaining tuples.
+		 */
+		pstate->firstmatch = false;
+
+		/* Shouldn't have to invalidate 'prechecked', though */
+		Assert(!pstate->prechecked);
+
+		/*
+		 * Once we return we'll have a new set of required array keys, whose
+		 * "tuple before array keys" recheck counter should start from 0.
+		 *
+		 * Note that we deliberately avoid touching targetdistance, since
+		 * that's still considered representative of the page as a whole.
+		 */
+		pstate->rechecks = 0;
+	}
+
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		Datum		tupdatum;
+		bool		required = false,
+					required_opposite_direction_only = false,
+					tupnull;
+		int32		result;
+		int			set_elem = 0;
+
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+		{
+			/* Manage array state */
+			if (cur->sk_flags & SK_SEARCHARRAY)
+			{
+				array = &so->arrayKeys[arrayidx++];
+				Assert(array->scan_key == ikey);
+			}
+		}
+		else
+		{
+			/*
+			 * Are any inequalities required in the opposite direction only
+			 * present here?
+			 */
+			if (((ScanDirectionIsForward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQBKWD))) ||
+				 (ScanDirectionIsBackward(dir) &&
+				  (cur->sk_flags & (SK_BT_REQFWD)))))
+				has_required_opposite_direction_only =
+					required_opposite_direction_only = true;
+		}
+
+		/* Optimization: skip over known-satisfied scan keys */
+		if (ikey < sktrig)
+			continue;
+
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
+		{
+			Assert(sktrig_required);
+
+			required = true;
+
+			if (cur->sk_attno > tupnatts)
+			{
+				/* Set this just like _bt_tuple_before_array_skeys */
+				Assert(sktrig < ikey);
+				so->scanBehind = true;
+			}
+		}
+
+		/*
+		 * Handle a required non-array scan key that the initial call to
+		 * _bt_check_compare indicated triggered array advancement, if any.
+		 *
+		 * The non-array scan key's strategy will be <, <=, or = during a
+		 * forwards scan (or any one of =, >=, or > during a backwards scan).
+		 * It follows that the corresponding tuple attribute's value must now
+		 * be either > or >= the scan key value (for backwards scans it must
+		 * be either < or <= that value).
+		 *
+		 * If this is a required equality strategy scan key, this is just an
+		 * optimization; _bt_tuple_before_array_skeys already confirmed that
+		 * this scan key places us ahead of caller's tuple.  There's no need
+		 * to repeat that work now.  (The same underlying principle also gets
+		 * applied by the cur_elem_trig optimization used to speed up searches
+		 * for the next array element.)
+		 *
+		 * If this is a required inequality strategy scan key, we _must_ rely
+		 * on _bt_check_compare like this; we aren't capable of directly
+		 * evaluating required inequality strategy scan keys here, on our own.
+		 */
+		if (ikey == sktrig && !array)
+		{
+			Assert(sktrig_required && required && all_required_satisfied);
+
+			/* Use "beyond end" advancement.  See below for an explanation. */
+			beyond_end_advance = true;
+			all_satisfied = all_required_satisfied = false;
+
+			/*
+			 * Set a flag that remembers that this was an inequality required
+			 * in the opposite scan direction only, that nevertheless
+			 * triggered the call here.
+			 *
+			 * This only happens when an inequality operator (which must be
+			 * strict) encounters a group of NULLs that indicate the end of
+			 * non-NULL values for tuples in the current scan direction.
+			 */
+			if (unlikely(required_opposite_direction_only))
+				oppodir_inequality_sktrig = true;
+
+			continue;
+		}
+
+		/*
+		 * Nothing more for us to do with an inequality strategy scan key that
+		 * wasn't the one that _bt_check_compare stopped on, though.
+		 *
+		 * Note: if our later call to _bt_check_compare (to recheck caller's
+		 * tuple) sets continuescan=false due to finding this same inequality
+		 * unsatisfied (possible when it's required in the scan direction),
+		 * we'll deal with it via a recursive "second pass" call.
+		 */
+		else if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Nothing for us to do with an equality strategy scan key that isn't
+		 * marked required, either -- unless it's a non-required array
+		 */
+		else if (!required && !array)
+			continue;
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose binary search triggered "beyond end of array
+		 * element" array advancement due to encountering a tuple attribute
+		 * value > the closest matching array key (or < for backwards scans).
+		 */
+		if (beyond_end_advance)
+		{
+			int			final_elem_dir;
+
+			if (ScanDirectionIsBackward(dir) || !array)
+				final_elem_dir = 0;
+			else
+				final_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != final_elem_dir)
+			{
+				array->cur_elem = final_elem_dir;
+				cur->sk_argument = array->elem_values[final_elem_dir];
+			}
+
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for all array scan keys after a required
+		 * array scan key whose tuple attribute was < the closest matching
+		 * array key when we dealt with it (or > for backwards scans).
+		 *
+		 * This earlier required array key already puts us ahead of caller's
+		 * tuple in the key space (for the current scan direction).  We must
+		 * make sure that subsequent lower-order array keys do not put us too
+		 * far ahead (ahead of tuples that have yet to be seen by our caller).
+		 * For example, when a tuple "(a, b) = (42, 5)" advances the array
+		 * keys on "a" from 40 to 45, we must also set "b" to whatever the
+		 * first array element for "b" is.  It would be wrong to allow "b" to
+		 * be set based on the tuple value.
+		 *
+		 * Perform the same steps with truncated high key attributes.  You can
+		 * think of this as a "binary search" for the element closest to the
+		 * value -inf.  Again, the arrays must never get ahead of the scan.
+		 */
+		if (!all_required_satisfied || cur->sk_attno > tupnatts)
+		{
+			int			first_elem_dir;
+
+			if (ScanDirectionIsForward(dir) || !array)
+				first_elem_dir = 0;
+			else
+				first_elem_dir = array->num_elems - 1;
+
+			if (array && array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				cur->sk_argument = array->elem_values[first_elem_dir];
+			}
+
+			continue;
+		}
+
+		/*
+		 * Search in scankey's array for the corresponding tuple attribute
+		 * value from caller's tuple
+		 */
+		tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
+
+		if (array)
+		{
+			bool		cur_elem_trig = (sktrig_required && ikey == sktrig);
+
+			/*
+			 * Binary search for closest match that's available from the array
+			 */
+			set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+											  cur_elem_trig, dir,
+											  tupdatum, tupnull, array, cur,
+											  &result);
+
+			Assert(set_elem >= 0 && set_elem < array->num_elems);
+		}
+		else
+		{
+			Assert(sktrig_required && required);
+
+			/*
+			 * This is a required non-array equality strategy scan key, which
+			 * we'll treat as a degenerate single element array.
+			 *
+			 * This scan key's imaginary "array" can't really advance, but it
+			 * can still roll over like any other array.  (Actually, this is
+			 * no different to real single value arrays, which never advance
+			 * without rolling over -- they can never truly advance, either.)
+			 */
+			result = _bt_compare_array_skey(&so->orderProcs[ikey],
+											tupdatum, tupnull,
+											cur->sk_argument, cur);
+		}
+
+		/*
+		 * Consider "beyond end of array element" array advancement.
+		 *
+		 * When the tuple attribute value is > the closest matching array key
+		 * (or < in the backwards scan case), we need to ratchet this array
+		 * forward (backward) by one increment, so that caller's tuple ends up
+		 * being < final array value instead (or > final array value instead).
+		 * This process has to work for all of the arrays, not just this one:
+		 * it must "carry" to higher-order arrays when the set_elem that we
+		 * just found happens to be the final one for the scan's direction.
+		 * Incrementing (decrementing) set_elem itself isn't good enough.
+		 *
+		 * Our approach is to provisionally use set_elem as if it was an exact
+		 * match now, then set each later/less significant array to whatever
+		 * its final element is.  Once outside the loop we'll then "increment
+		 * this array's set_elem" by calling _bt_advance_array_keys_increment.
+		 * That way the process rolls over to higher order arrays as needed.
+		 *
+		 * Under this scheme any required arrays only ever ratchet forwards
+		 * (or backwards), and always do so to the maximum possible extent
+		 * that we can know will be safe without seeing the scan's next tuple.
+		 * We don't need any special handling for required scan keys that lack
+		 * a real array to advance, nor for redundant scan keys that couldn't
+		 * be eliminated by _bt_preprocess_keys.  It won't matter if some of
+		 * our "true" array scan keys (or even all of them) are non-required.
+		 */
+		if (required &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+			beyond_end_advance = true;
+
+		Assert(all_required_satisfied && all_satisfied);
+		if (result != 0)
+		{
+			/*
+			 * Track whether caller's tuple satisfies our new post-advancement
+			 * qual, for required scan keys, as well as for the entire set of
+			 * interesting scan keys (all required scan keys plus non-required
+			 * array scan keys are considered interesting.)
+			 */
+			all_satisfied = false;
+			if (required)
+				all_required_satisfied = false;
+			else
+			{
+				/*
+				 * There's no need to advance the arrays using the best
+				 * available match for a non-required array.  Give up now.
+				 * (Though note that sktrig_required calls still have to do
+				 * all the usual post-advancement steps, including the recheck
+				 * call to _bt_check_compare.)
+				 */
+				break;
+			}
+		}
+
+		/* Advance array keys, even when set_elem isn't an exact match */
+		if (array && array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			cur->sk_argument = array->elem_values[set_elem];
 		}
 	}
 
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
+	 * Consider if we need to advance the array keys incrementally to finish
+	 * off "beyond end of array element" array advancement.  This is the only
+	 * way that the array keys can be exhausted, which is how top-level index
+	 * scans usually determine that they've run out of tuples to return in the
+	 * current scan direction (less often the top-level scan just runs out of
+	 * tuples/pages before the scan's array keys are exhausted).
 	 */
-	if (changed || !so->arraysStarted)
-	{
-		_bt_preprocess_keys(scan);
-		/* The mark should have been set on a consistent set of keys... */
-		Assert(so->qual_ok);
-	}
-}
+	if (beyond_end_advance && !_bt_advance_array_keys_increment(scan, dir))
+		goto end_toplevel_scan;
 
+	Assert(_bt_verify_keys_with_arraykeys(scan));
+
+	/*
+	 * Does tuple now satisfy our new qual?  Recheck with _bt_check_compare.
+	 *
+	 * Calls triggered by an unsatisfied required scan key, whose tuple now
+	 * satisfies all required scan keys, but not all nonrequired array keys,
+	 * will still require a recheck call to _bt_check_compare.  They'll still
+	 * need its "second pass" handling of required inequality scan keys.
+	 * (Might have missed a still-unsatisfied required inequality scan key
+	 * that caller didn't detect as the sktrig scan key during its initial
+	 * _bt_check_compare call that used the old/original qual.)
+	 *
+	 * Calls triggered by an unsatisfied nonrequired array scan key never need
+	 * "second pass" handling of required inequalities (nor any other handling
+	 * of any required scan key).  All that matters is whether caller's tuple
+	 * satisfies the new qual, so it's safe to just skip the _bt_check_compare
+	 * recheck when we've already determined that it can only return 'false'.
+	 */
+	if ((sktrig_required && all_required_satisfied) ||
+		(!sktrig_required && all_satisfied))
+	{
+		int			nsktrig = sktrig + 1;
+		bool		continuescan;
+
+		Assert(all_required_satisfied);
+
+		/* Recheck _bt_check_compare on behalf of caller */
+		if (_bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+							  false, false, false,
+							  &continuescan, &nsktrig) &&
+			!so->scanBehind)
+		{
+			/* This tuple satisfies the new qual */
+			Assert(all_satisfied && continuescan);
+
+			if (pstate)
+				pstate->continuescan = true;
+
+			return true;
+		}
+
+		/*
+		 * Consider "second pass" handling of required inequalities.
+		 *
+		 * It's possible that our _bt_check_compare call indicated that the
+		 * scan should end due to some unsatisfied inequality that wasn't
+		 * initially recognized as such by us.  Handle this by calling
+		 * ourselves recursively, this time indicating that the trigger is the
+		 * inequality that we missed first time around (and using a set of
+		 * required array/equality keys that are now exact matches for tuple).
+		 *
+		 * We make a strong, general guarantee that every _bt_checkkeys call
+		 * here will advance the array keys to the maximum possible extent
+		 * that we can know to be safe based on caller's tuple alone.  If we
+		 * didn't perform this step, then that guarantee wouldn't quite hold.
+		 */
+		if (unlikely(!continuescan))
+		{
+			bool		satisfied PG_USED_FOR_ASSERTS_ONLY;
+
+			Assert(sktrig_required);
+			Assert(so->keyData[nsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * The tuple must use "beyond end" advancement during the
+			 * recursive call, so we cannot possibly end up back here when
+			 * recursing.  We'll consume a small, fixed amount of stack space.
+			 */
+			Assert(!beyond_end_advance);
+
+			/* Advance the array keys a second time using same tuple */
+			satisfied = _bt_advance_array_keys(scan, pstate, tuple, tupnatts,
+											   tupdesc, nsktrig, true);
+
+			/* This tuple doesn't satisfy the inequality */
+			Assert(!satisfied);
+			return false;
+		}
+
+		/*
+		 * Some non-required scan key (from new qual) still not satisfied.
+		 *
+		 * All scan keys required in the current scan direction must still be
+		 * satisfied, though, so we can trust all_required_satisfied below.
+		 */
+	}
+
+	/*
+	 * When we were called just to deal with "advancing" non-required arrays,
+	 * this is as far as we can go (cannot stop the scan for these callers)
+	 */
+	if (!sktrig_required)
+	{
+		/* Caller's tuple doesn't match any qual */
+		return false;
+	}
+
+	/*
+	 * Postcondition array state assertions (for still-unsatisfied tuples).
+	 *
+	 * By here we have established that the scan's required arrays (scan must
+	 * have at least one required array) advanced, without becoming exhausted.
+	 *
+	 * Caller's tuple is now < the newly advanced array keys (or > when this
+	 * is a backwards scan), except in the case where we only got this far due
+	 * to an unsatisfied non-required scan key.  Verify that with an assert.
+	 *
+	 * Note: we don't just quit at this point when all required scan keys were
+	 * found to be satisfied because we need to consider edge-cases involving
+	 * scan keys required in the opposite direction only; those aren't tracked
+	 * by all_required_satisfied. (Actually, oppodir_inequality_sktrig trigger
+	 * scan keys are tracked by all_required_satisfied, since it's convenient
+	 * for _bt_check_compare to behave as if they are required in the current
+	 * scan direction to deal with NULLs.  We'll account for that separately.)
+	 */
+	Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts,
+										false, 0, NULL) ==
+		   !all_required_satisfied);
+
+	/*
+	 * We generally permit primitive index scans to continue onto the next
+	 * sibling page when the page's finaltup satisfies all required scan keys
+	 * at the point where we're between pages.
+	 *
+	 * If caller's tuple is also the page's finaltup, and we see that required
+	 * scan keys still aren't satisfied, start a new primitive index scan.
+	 */
+	if (!all_required_satisfied && pstate->finaltup == tuple)
+		goto new_prim_scan;
+
+	/*
+	 * Proactively check finaltup (don't wait until finaltup is reached by the
+	 * scan) when it might well turn out to not be satisfied later on.
+	 *
+	 * Note: if so->scanBehind hasn't already been set for finaltup by us,
+	 * it'll be set during this call to _bt_tuple_before_array_skeys.  Either
+	 * way, it'll be set correctly (for the whole page) after this point.
+	 */
+	if (!all_required_satisfied && pstate->finaltup &&
+		_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
+									 BTreeTupleGetNAtts(pstate->finaltup, rel),
+									 false, 0, &so->scanBehind))
+		goto new_prim_scan;
+
+	/*
+	 * When we encounter a truncated finaltup high key attribute, we're
+	 * optimistic about the chances of its corresponding required scan key
+	 * being satisfied when we go on to check it against tuples from this
+	 * page's right sibling leaf page.  We consider truncated attributes to be
+	 * satisfied by required scan keys, which allows the primitive index scan
+	 * to continue to the next leaf page.  We must set so->scanBehind to true
+	 * to remember that the last page's finaltup had "satisfied" required scan
+	 * keys for one or more truncated attribute values (scan keys required in
+	 * _either_ scan direction).
+	 *
+	 * There is a chance that _bt_checkkeys (which checks so->scanBehind) will
+	 * find that even the sibling leaf page's finaltup is < the new array
+	 * keys.  When that happens, our optimistic policy will have incurred a
+	 * single extra leaf page access that could have been avoided.
+	 *
+	 * A pessimistic policy would give backward scans a gratuitous advantage
+	 * over forward scans.  We'd punish forward scans for applying more
+	 * accurate information from the high key, rather than just using the
+	 * final non-pivot tuple as finaltup, in the style of backward scans.
+	 * Being pessimistic would also give some scans with non-required arrays a
+	 * perverse advantage over similar scans that use required arrays instead.
+	 *
+	 * You can think of this as a speculative bet on what the scan is likely
+	 * to find on the next page.  It's not much of a gamble, though, since the
+	 * untruncated prefix of attributes must strictly satisfy the new qual
+	 * (though it's okay if any non-required scan keys fail to be satisfied).
+	 */
+	if (so->scanBehind && has_required_opposite_direction_only)
+	{
+		/*
+		 * However, we avoid this behavior whenever the scan involves a scan
+		 * key required in the opposite direction to the scan only, along with
+		 * a finaltup with at least one truncated attribute that's associated
+		 * with a scan key marked required (required in either direction).
+		 *
+		 * _bt_check_compare simply won't stop the scan for a scan key that's
+		 * marked required in the opposite scan direction only.  That leaves
+		 * us without any reliable way of reconsidering any opposite-direction
+		 * inequalities if it turns out that starting a new primitive index
+		 * scan will allow _bt_first to skip ahead by a great many leaf pages
+		 * (see next section for details of how that works).
+		 */
+		goto new_prim_scan;
+	}
+
+	/*
+	 * Handle inequalities marked required in the opposite scan direction.
+	 * They can also signal that we should start a new primitive index scan.
+	 *
+	 * It's possible that the scan is now positioned where "matching" tuples
+	 * begin, and that caller's tuple satisfies all scan keys required in the
+	 * current scan direction.  But if caller's tuple still doesn't satisfy
+	 * other scan keys that are required in the opposite scan direction only
+	 * (e.g., a required >= strategy scan key when scan direction is forward),
+	 * it's still possible that there are many leaf pages before the page that
+	 * _bt_first could skip straight to.  Groveling through all those pages
+	 * will always give correct answers, but it can be very inefficient.  We
+	 * must avoid needlessly scanning extra pages.
+	 *
+	 * Separately, it's possible that _bt_check_compare set continuescan=false
+	 * for a scan key that's required in the opposite direction only.  This is
+	 * a special case, that happens only when _bt_check_compare sees that the
+	 * inequality encountered a NULL value.  This signals the end of non-NULL
+	 * values in the current scan direction, which is reason enough to end the
+	 * (primitive) scan.  If this happens at the start of a large group of
+	 * NULL values, then we shouldn't expect to be called again until after
+	 * the scan has already read indefinitely-many leaf pages full of tuples
+	 * with NULL suffix values.  We need a separate test for this case so that
+	 * we don't miss our only opportunity to skip over such a group of pages.
+	 *
+	 * Apply a test against finaltup to detect and recover from the problem:
+	 * if even finaltup doesn't satisfy such an inequality, we just skip by
+	 * starting a new primitive index scan.  When we skip, we know for sure
+	 * that all of the tuples on the current page following caller's tuple are
+	 * also before the _bt_first-wise start of tuples for our new qual.  That
+	 * at least suggests many more skippable pages beyond the current page.
+	 */
+	if (has_required_opposite_direction_only && pstate->finaltup &&
+		(all_required_satisfied || oppodir_inequality_sktrig))
+	{
+		int			nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+		ScanDirection flipped;
+		bool		continuescanflip;
+		int			opsktrig;
+
+		/*
+		 * We're checking finaltup (which is usually not caller's tuple), so
+		 * cannot reuse work from caller's earlier _bt_check_compare call.
+		 *
+		 * Flip the scan direction when calling _bt_check_compare this time,
+		 * so that it will set continuescanflip=false when it encounters an
+		 * inequality required in the opposite scan direction.
+		 */
+		Assert(!so->scanBehind);
+		opsktrig = 0;
+		flipped = -dir;
+		_bt_check_compare(scan, flipped,
+						  pstate->finaltup, nfinaltupatts, tupdesc,
+						  false, false, false,
+						  &continuescanflip, &opsktrig);
+
+		/*
+		 * If we ended up here due to the all_required_satisfied criteria,
+		 * test opsktrig in a way that ensures that finaltup contains the same
+		 * prefix of key columns as caller's tuple (a prefix that satisfies
+		 * earlier required-in-current-direction scan keys).
+		 *
+		 * If we ended up here due to the oppodir_inequality_sktrig criteria,
+		 * test opsktrig in a way that ensures that the same scan key that our
+		 * caller found to be unsatisfied (by the scan's tuple) was also the
+		 * one unsatisfied just now (by finaltup).  That way we'll only start
+		 * a new primitive scan when we're sure that both tuples _don't_ share
+		 * the same prefix of satisfied equality-constrained attribute values,
+		 * and that finaltup has a non-NULL attribute value indicated by the
+		 * unsatisfied scan key at offset opsktrig/sktrig.  (This depends on
+		 * _bt_check_compare not caring about the direction that inequalities
+		 * are required in whenever NULL attribute values are unsatisfied.  It
+		 * only cares about the scan direction, and its relationship to
+		 * whether NULLs are stored first or last relative to non-NULLs.)
+		 */
+		Assert(all_required_satisfied != oppodir_inequality_sktrig);
+		if (unlikely(!continuescanflip &&
+					 ((all_required_satisfied && opsktrig > sktrig) ||
+					  (oppodir_inequality_sktrig && opsktrig >= sktrig))))
+		{
+			Assert(so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber);
+
+			/*
+			 * Make sure that any non-required arrays are set to the first
+			 * array element for the current scan direction
+			 */
+			_bt_rewind_nonrequired_arrays(scan, dir);
+
+			goto new_prim_scan;
+		}
+	}
+
+	/*
+	 * Stick with the ongoing primitive index scan for now.
+	 *
+	 * It's possible that later tuples will also turn out to have values that
+	 * are still < the now-current array keys (or > the current array keys).
+	 * Our caller will handle this by performing what amounts to a linear
+	 * search of the page, implemented by calling _bt_check_compare and then
+	 * _bt_tuple_before_array_skeys for each tuple.
+	 *
+	 * This approach has various advantages over a binary search of the page.
+	 * We expect that our caller will quickly discover the next tuple covered
+	 * by the current array keys.  Repeated binary searches of the page (one
+	 * binary search per array advancement) is unlikely to outperform one
+	 * continuous linear search of the whole page.
+	 *
+	 * Note: the linear search process has a "look ahead" mechanism that
+	 * allows _bt_checkkeys to detect cases where the page contains an
+	 * excessive number of "before array key" tuples.  If there is a large
+	 * group of non-matching tuples (tuples located before the key space where
+	 * we expect to find the first tuple matching the new array keys), then
+	 * _bt_readpage is instructed to skip over many of those tuples.
+	 */
+	pstate->continuescan = true;	/* Override _bt_check_compare */
+	so->needPrimScan = false;	/* _bt_readpage has more tuples to check */
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+new_prim_scan:
+
+	/*
+	 * End this primitive index scan, but schedule another.
+	 *
+	 * Note: If the scan direction happens to change, this scheduled primitive
+	 * index scan won't go ahead after all.
+	 */
+	if (!scan->parallel_scan ||
+		_bt_parallel_primscan_schedule(scan, pstate->prev_scan_page))
+	{
+		pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+		so->needPrimScan = true;	/* ...but call _bt_first again */
+	}
+	else
+	{
+		/*
+		 * Parallel index scan worker failed to schedule another primitive
+		 * index scan.  Handle this by continuing the current scan.
+		 *
+		 * This is just an optimization.  We could handle this in the usual
+		 * way, but it's better to avoid allowing btgettuple to return false
+		 * until there really are no more tuples to return.
+		 */
+		pstate->continuescan = true;	/* Override _bt_check_compare */
+		so->needPrimScan = false;	/* continue primitive scan to next page */
+
+		/* Optimization: avoid scanning any more tuples from this page */
+		if (ScanDirectionIsForward(dir))
+			pstate->skip = pstate->maxoff + 1;
+		else
+			pstate->skip = pstate->minoff - 1;
+
+		/*
+		 * Reset the array keys.  This is a simple way of dealing with
+		 * scanBehind invalidation.
+		 */
+		_bt_start_array_keys(scan, dir);
+	}
+
+	/* Caller's tuple doesn't match the new qual */
+	return false;
+
+end_toplevel_scan:
+
+	/*
+	 * End the current primitive index scan, but don't schedule another.
+	 *
+	 * This ends the entire top-level scan in the current scan direction.
+	 *
+	 * Note: The scan's arrays (including any non-required arrays) are now in
+	 * their final positions for the current scan direction.  If the scan
+	 * direction happens to change, then the arrays will already be in their
+	 * first positions for what will then be the current scan direction.
+	 */
+	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
+	so->needPrimScan = false;	/* ...don't call _bt_first again, though */
+
+	/* Caller's tuple doesn't match any qual */
+	return false;
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
  *
- * The given search-type keys (in scan->keyData[] or so->arrayKeyData[])
+ * The given search-type keys (taken from scan->keyData[])
  * are copied to so->keyData[] with possible transformation.
  * scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
  * the number of output keys (possibly less, never greater).
@@ -690,8 +2498,9 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * The output keys must be sorted by index attribute.  Presently we expect
  * (but verify) that the input keys are already so sorted --- this is done
  * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
- * within each attribute may be done as a byproduct of the processing here,
- * but no other code depends on that.
+ * within each attribute may be done as a byproduct of the processing here.
+ * That process must leave array scan keys (within an attribute) in the same
+ * order as corresponding entries from the scan's BTArrayKeyInfo array info.
  *
  * The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
  * if they must be satisfied in order to continue the scan forward or backward
@@ -748,8 +2557,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
  *
  * Note: the reason we have to copy the preprocessed scan keys into private
  * storage is that we are modifying the array based on comparisons of the
- * key argument values, which could change on a rescan or after moving to
- * new elements of array keys.  Therefore we can't overwrite the source data.
+ * key argument values, which could change on a rescan.  Therefore we can't
+ * overwrite the source data.
  */
 void
 _bt_preprocess_keys(IndexScanDesc scan)
@@ -762,11 +2571,32 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
-	ScanKey		xform[BTMaxStrategyNumber];
+	BTScanKeyPreproc xform[BTMaxStrategyNumber];
 	bool		test_result;
 	int			i,
 				j;
 	AttrNumber	attno;
+	ScanKey		arrayKeyData;
+	int		   *keyDataMap = NULL;
+	int			arrayidx = 0;
+
+	/*
+	 * We're called at the start of each primitive index scan during top-level
+	 * scans that use equality array keys.  We can reuse the scan keys that
+	 * were output at the start of the scan's first primitive index scan.
+	 * There is no need to perform exactly the same work more than once.
+	 */
+	if (so->numberOfKeys > 0)
+	{
+		/*
+		 * An earlier call to _bt_advance_array_keys already set everything up
+		 * for us.  Just assert that the scan's existing output scan keys are
+		 * consistent with its current array elements.
+		 */
+		Assert(so->numArrayKeys);
+		Assert(_bt_verify_keys_with_arraykeys(scan));
+		return;
+	}
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -775,11 +2605,27 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	/* If any keys are SK_SEARCHARRAY type, set up array-key info */
+	arrayKeyData = _bt_preprocess_array_keys(scan);
+	if (!so->qual_ok)
+	{
+		/* unmatchable array, so give up */
+		return;
+	}
+
 	/*
-	 * Read so->arrayKeyData if array keys are present, else scan->keyData
+	 * Treat arrayKeyData[] (a partially preprocessed copy of scan->keyData[])
+	 * as our input if _bt_preprocess_array_keys just allocated it, else just
+	 * use scan->keyData[]
 	 */
-	if (so->arrayKeyData != NULL)
-		inkeys = so->arrayKeyData;
+	if (arrayKeyData)
+	{
+		inkeys = arrayKeyData;
+
+		/* Also maintain keyDataMap for remapping so->orderProc[] later */
+		keyDataMap = MemoryContextAlloc(so->arrayContext,
+										numberOfKeys * sizeof(int));
+	}
 	else
 		inkeys = scan->keyData;
 
@@ -800,6 +2646,18 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
 			_bt_mark_scankey_required(outkeys);
+		if (arrayKeyData)
+		{
+			/*
+			 * Don't call _bt_preprocess_array_keys_final in this fast path
+			 * (we'll miss out on the single value array transformation, but
+			 * that's not nearly as important when there's only one scan key)
+			 */
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+				   so->arrayKeys[0].scan_key == 0);
+		}
+
 		return;
 	}
 
@@ -859,13 +2717,29 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 * check, and we've rejected any combination of it with a regular
 			 * equality condition; but not with other types of conditions.
 			 */
-			if (xform[BTEqualStrategyNumber - 1])
+			if (xform[BTEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		eq = xform[BTEqualStrategyNumber - 1];
+				ScanKey		eq = xform[BTEqualStrategyNumber - 1].skey;
+				BTArrayKeyInfo *array = NULL;
+				FmgrInfo   *orderproc = NULL;
+
+				if (arrayKeyData && (eq->sk_flags & SK_SEARCHARRAY))
+				{
+					int			eq_in_ikey,
+								eq_arrayidx;
+
+					eq_in_ikey = xform[BTEqualStrategyNumber - 1].ikey;
+					eq_arrayidx = xform[BTEqualStrategyNumber - 1].arrayidx;
+					array = &so->arrayKeys[eq_arrayidx - 1];
+					orderproc = so->orderProcs + eq_in_ikey;
+
+					Assert(array->scan_key == eq_in_ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
 
 				for (j = BTMaxStrategyNumber; --j >= 0;)
 				{
-					ScanKey		chk = xform[j];
+					ScanKey		chk = xform[j].skey;
 
 					if (!chk || j == (BTEqualStrategyNumber - 1))
 						continue;
@@ -878,6 +2752,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					}
 
 					if (_bt_compare_scankey_args(scan, chk, eq, chk,
+												 array, orderproc,
 												 &test_result))
 					{
 						if (!test_result)
@@ -887,7 +2762,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 							return;
 						}
 						/* else discard the redundant non-equality key */
-						xform[j] = NULL;
+						Assert(!array || array->num_elems > 0);
+						xform[j].skey = NULL;
+						xform[j].ikey = -1;
 					}
 					/* else, cannot determine redundancy, keep both keys */
 				}
@@ -896,36 +2773,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			}
 
 			/* try to keep only one of <, <= */
-			if (xform[BTLessStrategyNumber - 1]
-				&& xform[BTLessEqualStrategyNumber - 1])
+			if (xform[BTLessStrategyNumber - 1].skey
+				&& xform[BTLessEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		lt = xform[BTLessStrategyNumber - 1];
-				ScanKey		le = xform[BTLessEqualStrategyNumber - 1];
+				ScanKey		lt = xform[BTLessStrategyNumber - 1].skey;
+				ScanKey		le = xform[BTLessEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, le, lt, le,
+				if (_bt_compare_scankey_args(scan, le, lt, le, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTLessEqualStrategyNumber - 1] = NULL;
+						xform[BTLessEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTLessStrategyNumber - 1] = NULL;
+						xform[BTLessStrategyNumber - 1].skey = NULL;
 				}
 			}
 
 			/* try to keep only one of >, >= */
-			if (xform[BTGreaterStrategyNumber - 1]
-				&& xform[BTGreaterEqualStrategyNumber - 1])
+			if (xform[BTGreaterStrategyNumber - 1].skey
+				&& xform[BTGreaterEqualStrategyNumber - 1].skey)
 			{
-				ScanKey		gt = xform[BTGreaterStrategyNumber - 1];
-				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1];
+				ScanKey		gt = xform[BTGreaterStrategyNumber - 1].skey;
+				ScanKey		ge = xform[BTGreaterEqualStrategyNumber - 1].skey;
 
-				if (_bt_compare_scankey_args(scan, ge, gt, ge,
+				if (_bt_compare_scankey_args(scan, ge, gt, ge, NULL, NULL,
 											 &test_result))
 				{
 					if (test_result)
-						xform[BTGreaterEqualStrategyNumber - 1] = NULL;
+						xform[BTGreaterEqualStrategyNumber - 1].skey = NULL;
 					else
-						xform[BTGreaterStrategyNumber - 1] = NULL;
+						xform[BTGreaterStrategyNumber - 1].skey = NULL;
 				}
 			}
 
@@ -936,11 +2813,13 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			 */
 			for (j = BTMaxStrategyNumber; --j >= 0;)
 			{
-				if (xform[j])
+				if (xform[j].skey)
 				{
 					ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-					memcpy(outkey, xform[j], sizeof(ScanKeyData));
+					memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+					if (arrayKeyData)
+						keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 					if (priorNumberOfEqualCols == attno - 1)
 						_bt_mark_scankey_required(outkey);
 				}
@@ -966,6 +2845,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
+			if (arrayKeyData)
+				keyDataMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 
@@ -977,20 +2858,112 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			continue;
 		}
 
-		/* have we seen one of these before? */
-		if (xform[j] == NULL)
+		/*
+		 * Does this input scan key require further processing as an array?
+		 */
+		if (cur->sk_strategy == InvalidStrategy)
 		{
-			/* nope, so remember this scankey */
-			xform[j] = cur;
+			/* _bt_preprocess_array_keys marked this array key redundant */
+			Assert(arrayKeyData);
+			Assert(cur->sk_flags & SK_SEARCHARRAY);
+			continue;
+		}
+
+		if (cur->sk_strategy == BTEqualStrategyNumber &&
+			(cur->sk_flags & SK_SEARCHARRAY))
+		{
+			/* _bt_preprocess_array_keys kept this array key */
+			Assert(arrayKeyData);
+			arrayidx++;
+		}
+
+		/*
+		 * have we seen a scan key for this same attribute and using this same
+		 * operator strategy before now?
+		 */
+		if (xform[j].skey == NULL)
+		{
+			/* nope, so this scan key wins by default (at least for now) */
+			xform[j].skey = cur;
+			xform[j].ikey = i;
+			xform[j].arrayidx = arrayidx;
 		}
 		else
 		{
-			/* yup, keep only the more restrictive key */
-			if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
-										 &test_result))
+			FmgrInfo   *orderproc = NULL;
+			BTArrayKeyInfo *array = NULL;
+
+			/*
+			 * Seen one of these before, so keep only the more restrictive key
+			 * if possible
+			 */
+			if (j == (BTEqualStrategyNumber - 1) && arrayKeyData)
 			{
+				/*
+				 * Have to set up array keys
+				 */
+				if ((cur->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[arrayidx - 1];
+					orderproc = so->orderProcs + i;
+
+					Assert(array->scan_key == i);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+				else if ((xform[j].skey->sk_flags & SK_SEARCHARRAY))
+				{
+					array = &so->arrayKeys[xform[j].arrayidx - 1];
+					orderproc = so->orderProcs + xform[j].ikey;
+
+					Assert(array->scan_key == xform[j].ikey);
+					Assert(OidIsValid(orderproc->fn_oid));
+				}
+
+				/*
+				 * Both scan keys might have arrays, in which case we'll
+				 * arbitrarily pass only one of the arrays.  That won't
+				 * matter, since _bt_compare_scankey_args is aware that two
+				 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
+				 * failed to eliminate redundant arrays through array merging.
+				 * _bt_compare_scankey_args just returns false when it sees
+				 * this; it won't even try to examine either array.
+				 */
+			}
+
+			if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+										 array, orderproc, &test_result))
+			{
+				/* Have all we need to determine redundancy */
 				if (test_result)
-					xform[j] = cur;
+				{
+					Assert(!array || array->num_elems > 0);
+
+					/*
+					 * New key is more restrictive, and so replaces old key...
+					 */
+					if (j != (BTEqualStrategyNumber - 1) ||
+						!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
+					{
+						Assert(!array || array->scan_key == i);
+						xform[j].skey = cur;
+						xform[j].ikey = i;
+						xform[j].arrayidx = arrayidx;
+					}
+					else
+					{
+						/*
+						 * ...unless we have to keep the old key because it's
+						 * an array that rendered the new key redundant.  We
+						 * need to make sure that we don't throw away an array
+						 * scan key.  _bt_compare_scankey_args expects us to
+						 * always keep arrays (and discard non-arrays).
+						 */
+						Assert(j == (BTEqualStrategyNumber - 1));
+						Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
+						Assert(xform[j].ikey == array->scan_key);
+						Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+					}
+				}
 				else if (j == (BTEqualStrategyNumber - 1))
 				{
 					/* key == a && key == b, but a != b */
@@ -1002,22 +2975,130 @@ _bt_preprocess_keys(IndexScanDesc scan)
 			else
 			{
 				/*
-				 * We can't determine which key is more restrictive.  Keep the
-				 * previous one in xform[j] and push this one directly to the
-				 * output array.
+				 * We can't determine which key is more restrictive.  Push
+				 * xform[j] directly to the output array, then set xform[j] to
+				 * the new scan key.
+				 *
+				 * Note: We do things this way around so that our arrays are
+				 * always in the same order as their corresponding scan keys,
+				 * even with incomplete opfamilies.  _bt_advance_array_keys
+				 * depends on this.
 				 */
 				ScanKey		outkey = &outkeys[new_numberOfKeys++];
 
-				memcpy(outkey, cur, sizeof(ScanKeyData));
+				memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
+				if (arrayKeyData)
+					keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
 				if (numberOfEqualCols == attno - 1)
 					_bt_mark_scankey_required(outkey);
+				xform[j].skey = cur;
+				xform[j].ikey = i;
+				xform[j].arrayidx = arrayidx;
 			}
 		}
 	}
 
 	so->numberOfKeys = new_numberOfKeys;
+
+	/*
+	 * Now that we've output so->keyData[], and built a temporary mapping from
+	 * so->keyData[] (output scan keys) to scan->keyData[] (input scan keys),
+	 * fix each array->scan_key reference.  (Also consolidates so->orderProc[]
+	 * array, so it can be subscripted using so->keyData[]-wise offsets.)
+	 */
+	if (arrayKeyData)
+		_bt_preprocess_array_keys_final(scan, keyDataMap);
+
+	/* Could pfree arrayKeyData/keyDataMap now, but not worth the cycles */
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * Verify that the scan's qual state matches what we expect at the point that
+ * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
+ *
+ * We enforce a rule against non-required array scan keys: they must start out
+ * with whatever element is the first for the scan's current scan direction.
+ * See _bt_rewind_nonrequired_arrays comments for an explanation.
+ */
+static bool
+_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			arrayidx = 0;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array = NULL;
+		int			first_elem_dir;
+
+		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
+			cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+
+		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			continue;
+
+		if (ScanDirectionIsForward(dir))
+			first_elem_dir = 0;
+		else
+			first_elem_dir = array->num_elems - 1;
+
+		if (array->cur_elem != first_elem_dir)
+			return false;
+	}
+
+	return _bt_verify_keys_with_arraykeys(scan);
+}
+
+/*
+ * Verify that the scan's "so->keyData[]" scan keys are in agreement with
+ * its array key state
+ */
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	int			last_sk_attno = InvalidAttrNumber,
+				arrayidx = 0;
+
+	if (!so->qual_ok)
+		return false;
+
+	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
+	{
+		ScanKey		cur = so->keyData + ikey;
+		BTArrayKeyInfo *array;
+
+		if (cur->sk_strategy != BTEqualStrategyNumber ||
+			!(cur->sk_flags & SK_SEARCHARRAY))
+			continue;
+
+		array = &so->arrayKeys[arrayidx++];
+		if (array->scan_key != ikey)
+			return false;
+
+		if (array->num_elems <= 0)
+			return false;
+
+		if (cur->sk_argument != array->elem_values[array->cur_elem])
+			return false;
+		if (last_sk_attno > cur->sk_attno)
+			return false;
+		last_sk_attno = cur->sk_attno;
+	}
+
+	if (arrayidx != so->numArrayKeys)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Compare two scankey values using a specified operator.
  *
@@ -1033,9 +3114,24 @@ _bt_preprocess_keys(IndexScanDesc scan)
  * we store the operator result in *result and return true.  We return false
  * if the comparison could not be made.
  *
+ * If either leftarg or rightarg are an array, we'll apply array-specific
+ * rules to determine which array elements are redundant on behalf of caller.
+ * It is up to our caller to save whichever of the two scan keys is the array,
+ * and discard the non-array scan key (the non-array scan key is guaranteed to
+ * be redundant with any complete opfamily).  Caller isn't expected to call
+ * here with a pair of array scan keys provided we're dealing with a complete
+ * opfamily (_bt_preprocess_array_keys will merge array keys together to make
+ * sure of that).
+ *
+ * Note: we'll also shrink caller's array as needed to eliminate redundant
+ * array elements.  One reason why caller should prefer to discard non-array
+ * scan keys is so that we'll have the opportunity to shrink the array
+ * multiple times, in multiple calls (for each of several other scan keys on
+ * the same index attribute).
+ *
  * Note: op always points at the same ScanKey as either leftarg or rightarg.
- * Since we don't scribble on the scankeys, this aliasing should cause no
- * trouble.
+ * Since we don't scribble on the scankeys themselves, this aliasing should
+ * cause no trouble.
  *
  * Note: this routine needs to be insensitive to any DESC option applied
  * to the index column.  For example, "x < 4" is a tighter constraint than
@@ -1044,6 +3140,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 static bool
 _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 						 ScanKey leftarg, ScanKey rightarg,
+						 BTArrayKeyInfo *array, FmgrInfo *orderproc,
 						 bool *result)
 {
 	Relation	rel = scan->indexRelation;
@@ -1112,6 +3209,48 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 		return true;
 	}
 
+	/*
+	 * If either leftarg or rightarg are equality-type array scankeys, we need
+	 * specialized handling (since by now we know that IS NULL wasn't used)
+	 */
+	if (array)
+	{
+		bool		leftarray,
+					rightarray;
+
+		leftarray = ((leftarg->sk_flags & SK_SEARCHARRAY) &&
+					 leftarg->sk_strategy == BTEqualStrategyNumber);
+		rightarray = ((rightarg->sk_flags & SK_SEARCHARRAY) &&
+					  rightarg->sk_strategy == BTEqualStrategyNumber);
+
+		/*
+		 * _bt_preprocess_array_keys is responsible for merging together array
+		 * scan keys, and will do so whenever the opfamily has the required
+		 * cross-type support.  If it failed to do that, we handle it just
+		 * like the case where we can't make the comparison ourselves.
+		 */
+		if (leftarray && rightarray)
+		{
+			/* Can't make the comparison */
+			*result = false;	/* suppress compiler warnings */
+			return false;
+		}
+
+		/*
+		 * Otherwise we need to determine if either one of leftarg or rightarg
+		 * uses an array, then pass this through to a dedicated helper
+		 * function.
+		 */
+		if (leftarray)
+			return _bt_compare_array_scankey_args(scan, leftarg, rightarg,
+												  orderproc, array, result);
+		else if (rightarray)
+			return _bt_compare_array_scankey_args(scan, rightarg, leftarg,
+												  orderproc, array, result);
+
+		/* FALL THRU */
+	}
+
 	/*
 	 * The opfamily we need to worry about is identified by the index column.
 	 */
@@ -1351,60 +3490,234 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
  * Forward scan callers can pass a high key tuple in the hopes of having
  * us set *continuescan to false, and avoiding an unnecessary visit to
  * the page to the right.
  *
+ * Advances the scan's array keys when necessary for arrayKeys=true callers.
+ * Caller can avoid all array related side-effects when calling just to do a
+ * page continuescan precheck -- pass arrayKeys=false for that.  Scans without
+ * any arrays keys must always pass arrayKeys=false.
+ *
+ * Also stops and starts primitive index scans for arrayKeys=true callers.
+ * Scans with array keys are required to set up page state that helps us with
+ * this.  The page's finaltup tuple (the page high key for a forward scan, or
+ * the page's first non-pivot tuple for a backward scan) must be set in
+ * pstate.finaltup ahead of the first call here for the page (or possibly the
+ * first call after an initial continuescan-setting page precheck call).  Set
+ * this to NULL for rightmost page (or the leftmost page for backwards scans).
+ *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: page level input and output parameters
+ * arrayKeys: should we advance the scan's array keys if necessary?
  * tuple: index tuple to test
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
- * continuescanPrechecked: indicates that *continuescan flag is known to
- * 						   be true for the last item on the page
- * haveFirstMatch: indicates that we already have at least one match
- * 							  in the current page
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan,
-			  bool continuescanPrechecked, bool haveFirstMatch)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
+			  IndexTuple tuple, int tupnatts)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
-	int			ikey;
-	ScanKey		key;
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ScanDirection dir = pstate->dir;
+	int			ikey = 0;
+	bool		res;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
 
-	*continuescan = true;		/* default assumption */
+	res = _bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+							arrayKeys, pstate->prechecked, pstate->firstmatch,
+							&pstate->continuescan, &ikey);
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+#ifdef USE_ASSERT_CHECKING
+	if (!arrayKeys && so->numArrayKeys)
 	{
-		Datum		datum;
-		bool		isNull;
-		Datum		test;
-		bool		requiredSameDir = false,
-					requiredOppositeDir = false;
+		/*
+		 * This is a continuescan precheck call for a scan with array keys.
+		 *
+		 * Assert that the scan isn't in danger of becoming confused.
+		 */
+		Assert(!so->scanBehind && !pstate->prechecked && !pstate->firstmatch);
+		Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
+											 tupnatts, false, 0, NULL));
+	}
+	if (pstate->prechecked || pstate->firstmatch)
+	{
+		bool		dcontinuescan;
+		int			dikey = 0;
 
 		/*
-		 * Check if the key is required for ordered scan in the same or
-		 * opposite direction.  Save as flag variables for future usage.
+		 * Call relied on continuescan/firstmatch prechecks -- assert that we
+		 * get the same answer without those optimizations
+		 */
+		Assert(res == _bt_check_compare(scan, dir, tuple, tupnatts, tupdesc,
+										arrayKeys, false, false,
+										&dcontinuescan, &dikey));
+		Assert(pstate->continuescan == dcontinuescan);
+	}
+#endif
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality strategy array scan keys.  Otherwise we can only
+	 * accept _bt_check_compare's answer unreservedly when it didn't set
+	 * pstate.continuescan=false.
+	 */
+	if (!arrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare call set continuescan=false in the presence of
+	 * equality type array keys.  This could mean that the tuple is just past
+	 * the end of matches for the current array keys.
+	 *
+	 * It's also possible that the scan is still _before_ the _start_ of
+	 * tuples matching the current set of array keys.  Check for that first.
+	 */
+	if (_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts, true,
+									 ikey, NULL))
+	{
+		/*
+		 * Tuple is still before the start of matches according to the scan's
+		 * required array keys (according to _all_ of its required equality
+		 * strategy keys, actually).
+		 *
+		 * _bt_advance_array_keys occasionally sets so->scanBehind to signal
+		 * that the scan's current position/tuples might be significantly
+		 * behind (multiple pages behind) its current array keys.  When this
+		 * happens, we need to be prepared to recover by starting a new
+		 * primitive index scan here, on our own.
+		 */
+		Assert(!so->scanBehind ||
+			   so->keyData[ikey].sk_strategy == BTEqualStrategyNumber);
+		if (unlikely(so->scanBehind) && pstate->finaltup &&
+			_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
+										 BTreeTupleGetNAtts(pstate->finaltup,
+															scan->indexRelation),
+										 false, 0, NULL))
+		{
+			/* Cut our losses -- start a new primitive index scan now */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else
+		{
+			/* Override _bt_check_compare, continue primitive scan */
+			pstate->continuescan = true;
+
+			/*
+			 * We will end up here repeatedly given a group of tuples > the
+			 * previous array keys and < the now-current keys (for a backwards
+			 * scan it's just the same, though the operators swap positions).
+			 *
+			 * We must avoid allowing this linear search process to scan very
+			 * many tuples from well before the start of tuples matching the
+			 * current array keys (or from well before the point where we'll
+			 * once again have to advance the scan's array keys).
+			 *
+			 * We keep the overhead under control by speculatively "looking
+			 * ahead" to later still-unscanned items from this same leaf page.
+			 * We'll only attempt this once the number of tuples that the
+			 * linear search process has examined starts to get out of hand.
+			 */
+			pstate->rechecks++;
+			if (pstate->rechecks >= LOOK_AHEAD_REQUIRED_RECHECKS)
+			{
+				/* See if we should skip ahead within the current leaf page */
+				_bt_check_look_ahead(scan, pstate, dir, tupnatts, tupdesc);
+
+				/*
+				 * Might have set pstate.skip to a later page offset.  When
+				 * that happens then _bt_readpage caller will inexpensively
+				 * skip ahead to a later tuple from the same page (the one
+				 * just after the tuple we successfully "looked ahead" to).
+				 */
+			}
+		}
+
+		/* This indextuple doesn't match the current qual, in any case */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scan).  It's now
+	 * clear that we _must_ advance any required array keys in lockstep with
+	 * the scan.
+	 */
+	return _bt_advance_array_keys(scan, pstate, tuple, tupnatts, tupdesc,
+								  ikey, true);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also sets *continuescan to false
+ * when it's also not possible for any later tuples to pass the current qual
+ * (with the scan's current set of array keys, in the current scan direction),
+ * in addition to setting *ikey to the so->keyData[] subscript/offset for the
+ * unsatisfied scan key (needed when caller must consider advancing the scan's
+ * array keys).
+ *
+ * This is a subroutine for _bt_checkkeys.  We provisionally assume that
+ * reaching the end of the current set of required keys (in particular the
+ * current required array keys) ends the ongoing (primitive) index scan.
+ * Callers without array keys should just end the scan right away when they
+ * find that continuescan has been set to false here by us.  Things are more
+ * complicated for callers with array keys.
+ *
+ * Callers with array keys must first consider advancing the arrays when
+ * continuescan has been set to false here by us.  They must then consider if
+ * it really does make sense to end the current (primitive) index scan, in
+ * light of everything that is known at that point.  (In general when we set
+ * continuescan=false for these callers it must be treated as provisional.)
+ *
+ * We deal with advancing unsatisfied non-required arrays directly, though.
+ * This is safe, since by definition non-required keys can't end the scan.
+ * This is just how we determine if non-required arrays are just unsatisfied
+ * by the current array key, or if they're truly unsatisfied (that is, if
+ * they're unsatisfied by every possible array key).
+ *
+ * Though we advance non-required array keys on our own, that shouldn't have
+ * any lasting consequences for the scan.  By definition, non-required arrays
+ * have no fixed relationship with the scan's progress.  (There are delicate
+ * considerations for non-required arrays when the arrays need to be advanced
+ * following our setting continuescan to false, but that doesn't concern us.)
+ *
+ * Pass advancenonrequired=false to avoid all array related side effects.
+ * This allows _bt_advance_array_keys caller to avoid infinite recursion.
+ */
+static bool
+_bt_check_compare(IndexScanDesc scan, ScanDirection dir,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool advancenonrequired, bool prechecked, bool firstmatch,
+				  bool *continuescan, int *ikey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	*continuescan = true;		/* default assumption */
+
+	for (; *ikey < so->numberOfKeys; (*ikey)++)
+	{
+		ScanKey		key = so->keyData + *ikey;
+		Datum		datum;
+		bool		isNull;
+		bool		requiredSameDir = false,
+					requiredOppositeDirOnly = false;
+
+		/*
+		 * Check if the key is required in the current scan direction, in the
+		 * opposite scan direction _only_, or in neither direction
 		 */
 		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			requiredSameDir = true;
 		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
 				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
-			requiredOppositeDir = true;
+			requiredOppositeDirOnly = true;
 
 		/*
 		 * If the caller told us the *continuescan flag is known to be true
@@ -1422,8 +3735,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		 * Both cases above work except for the row keys, where NULLs could be
 		 * found in the middle of matching values.
 		 */
-		if ((requiredSameDir || (requiredOppositeDir && haveFirstMatch)) &&
-			!(key->sk_flags & SK_ROW_HEADER) && continuescanPrechecked)
+		if (prechecked &&
+			(requiredSameDir || (requiredOppositeDirOnly && firstmatch)) &&
+			!(key->sk_flags & SK_ROW_HEADER))
 			continue;
 
 		if (key->sk_attno > tupnatts)
@@ -1434,7 +3748,6 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
@@ -1495,6 +3808,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1511,6 +3826,8 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1524,24 +3841,15 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		}
 
 		/*
-		 * Apply the key-checking function.  When the key is required for the
-		 * opposite direction scan, it must be already satisfied as soon as
-		 * there is already match on the page.  Except for the NULLs checking,
-		 * which have already done above.
+		 * Apply the key-checking function, though only if we must.
+		 *
+		 * When a key is required in the opposite-of-scan direction _only_,
+		 * then it must already be satisfied if firstmatch=true indicates that
+		 * an earlier tuple from this same page satisfied it earlier on.
 		 */
-		if (!(requiredOppositeDir && haveFirstMatch))
-		{
-			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-									 datum, key->sk_argument);
-		}
-		else
-		{
-			test = true;
-			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
-											 datum, key->sk_argument));
-		}
-
-		if (!DatumGetBool(test))
+		if (!(requiredOppositeDirOnly && firstmatch) &&
+			!DatumGetBool(FunctionCall2Coll(&key->sk_func, key->sk_collation,
+											datum, key->sk_argument)))
 		{
 			/*
 			 * Tuple fails this qual.  If it's a required qual for the current
@@ -1557,7 +3865,19 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				*continuescan = false;
 
 			/*
-			 * In any case, this indextuple doesn't match the qual.
+			 * If this is a non-required equality-type array key, the tuple
+			 * needs to be checked against every possible array key.  Handle
+			 * this by "advancing" the scan key's array to a matching value
+			 * (if we're successful then the tuple might match the qual).
+			 */
+			else if (advancenonrequired &&
+					 key->sk_strategy == BTEqualStrategyNumber &&
+					 (key->sk_flags & SK_SEARCHARRAY))
+				return _bt_advance_array_keys(scan, NULL, tuple, tupnatts,
+											  tupdesc, *ikey, false);
+
+			/*
+			 * This indextuple doesn't match the qual.
 			 */
 			return false;
 		}
@@ -1574,7 +3894,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1603,7 +3923,6 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * right could be any possible value.  Assume that truncated
 			 * attribute passes the qual.
 			 */
-			Assert(ScanDirectionIsForward(dir));
 			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
@@ -1630,6 +3949,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a forward scan, however, we must keep going, because we may
 				 * have initially positioned to the start of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * forward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
@@ -1646,6 +3967,8 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				 * because it's not possible for any future tuples to pass. On
 				 * a backward scan, however, we must keep going, because we
 				 * may have initially positioned to the end of the index.
+				 * (_bt_advance_array_keys also relies on this behavior during
+				 * backward scans.)
 				 */
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
@@ -1741,6 +4064,103 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 	return result;
 }
 
+/*
+ * Determine if a scan with array keys should consider looking ahead.
+ *
+ * This is a subroutine for _bt_checkkeys.  It limits the worst case cost of
+ * _bt_readpage's linear search.  Many scans that use array keys won't run
+ * into this problem, since "looking ahead" to pstate.finaltup at the point
+ * that the scan's arrays advance usually suffices.  It is worth controlling
+ * the cost of _bt_readpage's _bt_checkkeys-based linear search on pages that
+ * contain key space that matches several distinct array keys (or distinct
+ * sets of array keys) spaced apart by dozens or hundreds of non-pivot tuples.
+ *
+ * When we perform look ahead, and the process succeeds, sets pstate.skip,
+ * which instructs _bt_readpage to skip ahead to that tuple next (could be
+ * past the end of the scan's leaf page).
+ *
+ * We ramp the look ahead distance up as it continues to be effective, and
+ * aggressively decrease it when it stops working.  Cases where looking ahead
+ * is very effective will still require several calls here per _bt_readpage.
+ *
+ * Calling here stashes information about the progress of array advancement on
+ * the page using certain private fields in pstate.  We need to track our
+ * progress so far to control ramping the optimization up (and down).
+ */
+static void
+_bt_check_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
+					 ScanDirection dir, int tupnatts, TupleDesc tupdesc)
+{
+	OffsetNumber mid;
+	IndexTuple	ahead;
+	int			distance;
+
+	/*
+	 * Don't look ahead when there aren't enough tuples remaining on the page
+	 * (in the current scan direction) for it to really make sense.
+	 */
+	distance = LOOK_AHEAD_DEFAULT_DISTANCE;
+	if (ScanDirectionIsForward(dir))
+	{
+		if (pstate->offnum >= pstate->maxoff - distance)
+			return;
+
+		/* Also don't do anything with high key */
+		if (pstate->offnum < pstate->minoff)
+			return;
+
+		mid = pstate->offnum + distance;
+	}
+	else
+	{
+		if (pstate->offnum <= pstate->minoff + distance)
+			return;
+
+		mid = pstate->offnum - distance;
+	}
+
+	/*
+	 * The look ahead distance starts small, and ramps up as each call here
+	 * allows _bt_readpage to skip ever-more tuples from the current page
+	 */
+	if (!pstate->targetdistance)
+		pstate->targetdistance = distance;
+	else
+		pstate->targetdistance *= 2;
+
+	if (ScanDirectionIsForward(dir))
+		mid = Min(pstate->maxoff, pstate->offnum + pstate->targetdistance);
+	else
+		mid = Max(pstate->minoff, pstate->offnum - pstate->targetdistance);
+
+	ahead = (IndexTuple) PageGetItem(pstate->page,
+									 PageGetItemId(pstate->page, mid));
+	if (_bt_tuple_before_array_skeys(scan, dir, ahead, tupdesc, tupnatts,
+									 false, 0, NULL))
+	{
+		/*
+		 * Success -- instruct _bt_readpage to skip ahead to very next tuple
+		 * after the one we determined was still before the current array keys
+		 */
+		if (ScanDirectionIsForward(dir))
+			pstate->skip = mid + 1;
+		else
+			pstate->skip = mid - 1;
+	}
+	else
+	{
+		/*
+		 * Failure -- "ahead" tuple is too far ahead (we were too aggresive).
+		 *
+		 * Reset the number of rechecks, and aggressively reduce the target
+		 * distance.  Note that we're much more aggressive here than when
+		 * initially ramping up.
+		 */
+		pstate->rechecks = 0;
+		pstate->targetdistance /= 8;
+	}
+}
+
 /*
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9e35aaf56..fcf6d1d93 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -628,6 +628,8 @@ ExecIndexOnlyScanEstimate(IndexOnlyScanState *node,
 	EState	   *estate = node->ss.ps.state;
 
 	node->ioss_PscanLen = index_parallelscan_estimate(node->ioss_RelationDesc,
+													  node->ioss_NumScanKeys,
+													  node->ioss_NumOrderByKeys,
 													  estate->es_snapshot);
 	shm_toc_estimate_chunk(&pcxt->estimator, node->ioss_PscanLen);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2a3264599..8000feff4 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1644,6 +1644,8 @@ ExecIndexScanEstimate(IndexScanState *node,
 	EState	   *estate = node->ss.ps.state;
 
 	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 node->iss_NumScanKeys,
+													 node->iss_NumOrderByKeys,
 													 estate->es_snapshot);
 	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbd..2230b1310 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,12 +816,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
 	int			indexcol;
 
+	Assert(skip_nonnative_saop != NULL || scantype == ST_BITMAPSCAN);
+
 	/*
 	 * Check that index supports the desired scan type(s)
 	 */
@@ -880,19 +849,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -903,30 +864,18 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			IndexClause *iclause = (IndexClause *) lfirst(lc);
 			RestrictInfo *rinfo = iclause->rinfo;
 
-			/* We might need to omit ScalarArrayOpExpr clauses */
-			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			if (skip_nonnative_saop && !index->amsearcharray &&
+				IsA(rinfo->clause, ScalarArrayOpExpr))
 			{
-				if (!index->amsearcharray)
-				{
-					if (skip_nonnative_saop)
-					{
-						/* Ignore because not supported by index */
-						*skip_nonnative_saop = true;
-						continue;
-					}
-					/* Caller had better intend this only for bitmap scan */
-					Assert(scantype == ST_BITMAPSCAN);
-				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
+				/*
+				 * Caller asked us to generate IndexPaths that omit any
+				 * ScalarArrayOpExpr clauses when the underlying index AM
+				 * lacks native support.
+				 *
+				 * We must omit this clause (and tell caller about it).
+				 */
+				*skip_nonnative_saop = true;
+				continue;
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d..35f8f306e 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6572,21 +6572,26 @@ genericcostestimate(PlannerInfo *root,
 	selectivityQuals = add_predicate_to_index_quals(index, indexQuals);
 
 	/*
-	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * If caller didn't give us an estimate for ScalarArrayOpExpr index scans,
+	 * just assume that the number of index descents is the number of distinct
+	 * combinations of array elements from all of the scan's SAOP clauses.
 	 */
-	num_sa_scans = 1;
-	foreach(l, indexQuals)
+	num_sa_scans = costs->num_sa_scans;
+	if (num_sa_scans < 1)
 	{
-		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
-
-		if (IsA(rinfo->clause, ScalarArrayOpExpr))
+		num_sa_scans = 1;
+		foreach(l, indexQuals)
 		{
-			ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
-			double		alength = estimate_array_length(root, lsecond(saop->args));
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
 
-			if (alength > 1)
-				num_sa_scans *= alength;
+			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			{
+				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
+				double		alength = estimate_array_length(root, lsecond(saop->args));
+
+				if (alength > 1)
+					num_sa_scans *= alength;
+			}
 		}
 	}
 
@@ -6813,9 +6818,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
 	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
+	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform up
+	 * to N index descents (not just one), but the ScalarArrayOpExpr's
+	 * operator can be considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
@@ -6867,7 +6872,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
+				/* estimate SA descents by indexBoundQuals only */
 				if (alength > 1)
 					num_sa_scans *= alength;
 			}
@@ -6930,10 +6935,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
 
+		/*
+		 * btree automatically combines individual ScalarArrayOpExpr primitive
+		 * index scans whenever the tuples covered by the next set of array
+		 * keys are close to tuples covered by the current set.  That puts a
+		 * natural ceiling on the worst case number of descents -- there
+		 * cannot possibly be more than one descent per leaf page scanned.
+		 *
+		 * Clamp the number of descents to at most 1/3 the number of index
+		 * pages.  This avoids implausibly high estimates with low selectivity
+		 * paths, where scans usually require only one or two descents.  This
+		 * is most likely to help when there are several SAOP clauses, where
+		 * naively accepting the total number of distinct combinations of
+		 * array elements as the number of descents would frequently lead to
+		 * wild overestimates.
+		 *
+		 * We somewhat arbitrarily don't just make the cutoff the total number
+		 * of leaf pages (we make it 1/3 the total number of pages instead) to
+		 * give the btree code credit for its ability to continue on the leaf
+		 * level with low selectivity scans.
+		 */
+		num_sa_scans = Min(num_sa_scans, ceil(index->pages * 0.3333333));
+		num_sa_scans = Max(num_sa_scans, 1);
+
 		/*
 		 * As in genericcostestimate(), we have to adjust for any
 		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
 		 * to integer.
+		 *
+		 * It is tempting to make genericcostestimate behave as if SAOP
+		 * clauses work in almost the same way as scalar operators during
+		 * btree scans, making the top-level scan look like a continuous scan
+		 * (as opposed to num_sa_scans-many primitive index scans).  After
+		 * all, btree scans mostly work like that at runtime.  However, such a
+		 * scheme would badly bias genericcostestimate's simplistic appraoch
+		 * to calculating numIndexPages through prorating.
+		 *
+		 * Stick with the approach taken by non-native SAOP scans for now.
+		 * genericcostestimate will use the Mackert-Lohman formula to
+		 * compensate for repeat page fetches, even though that definitely
+		 * won't happen during btree scans (not for leaf pages, at least).
+		 * We're usually very pessimistic about the number of primitive index
+		 * scans that will be required, but it's not clear how to do better.
 		 */
 		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
@@ -6942,6 +6985,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * Now do generic index cost estimation.
 	 */
 	costs.numIndexTuples = numIndexTuples;
+	costs.num_sa_scans = num_sa_scans;
 
 	genericcostestimate(root, path, loop_count, &costs);
 
@@ -6952,9 +6996,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated SA
+	 * index descent.  The ones after the first one are not startup cost so
+	 * far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +7015,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * SA index descent.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index b68daa55a..76ac0fcdd 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -809,7 +809,8 @@ amrestrpos (IndexScanDesc scan);
   <para>
 <programlisting>
 Size
-amestimateparallelscan (void);
+amestimateparallelscan (int nkeys,
+                        int norderbys);
 </programlisting>
    Estimate and return the number of bytes of dynamic shared memory which
    the access method will be needed to perform a parallel scan.  (This number
@@ -817,6 +818,13 @@ amestimateparallelscan (void);
    AM-independent data in <structname>ParallelIndexScanDescData</structname>.)
   </para>
 
+  <para>
+   The <literal>nkeys</literal> and <literal>norderbys</literal>
+   parameters indicate the number of quals and ordering operators that will be
+   used in the scan; the same values will be passed to <function>amrescan</function>.
+   Note that the actual values of the scan keys aren't provided yet.
+  </para>
+
   <para>
    It is not necessary to implement this function for access methods which
    do not support parallel scans or for which the number of additional bytes
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6a74e4a24..ef7d9d6ea 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4064,6 +4064,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Queries that use certain <acronym>SQL</acronym> constructs to search for
+    rows matching any value out of a list or array of multiple scalar values
+    (see <xref linkend="functions-comparisons"/>) perform multiple
+    <quote>primitive</quote> index scans (up to one primitive scan per scalar
+    value) during query execution.  Each internal primitive index scan
+    increments <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>,
+    so it's possible for the count of index scans to significantly exceed the
+    total number of index scan executor node executions.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 8311a03c3..510646cbc 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -189,6 +189,58 @@ select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limi
       48 |      8
 (1 row)
 
+--
+-- Add coverage for ScalarArrayOp btree quals with pivot tuple constants
+--
+explain (costs off)
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82);
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Unique
+   ->  Index Only Scan using tenk1_hundred on tenk1
+         Index Cond: (hundred = ANY ('{47,48,72,82}'::integer[]))
+(3 rows)
+
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82);
+ hundred 
+---------
+      47
+      48
+      72
+      82
+(4 rows)
+
+explain (costs off)
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82) order by hundred desc;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Unique
+   ->  Index Only Scan Backward using tenk1_hundred on tenk1
+         Index Cond: (hundred = ANY ('{47,48,72,82}'::integer[]))
+(3 rows)
+
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82) order by hundred desc;
+ hundred 
+---------
+      82
+      72
+      48
+      47
+(4 rows)
+
+explain (costs off)
+select thousand from tenk1 where thousand in (364, 366,380) and tenthous = 200000;
+                                      QUERY PLAN                                       
+---------------------------------------------------------------------------------------
+ Index Only Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand = ANY ('{364,366,380}'::integer[])) AND (tenthous = 200000))
+(2 rows)
+
+select thousand from tenk1 where thousand in (364, 366,380) and tenthous = 200000;
+ thousand 
+----------
+(0 rows)
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 70ab47a92..cf6eac573 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1698,6 +1698,12 @@ SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
      0
 (1 row)
 
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IN (-1, 0, 1);
+ count 
+-------
+     1
+(1 row)
+
 DROP INDEX onek_nulltest;
 CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 desc nulls last,unique1);
 SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
@@ -1910,7 +1916,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1932,49 +1938,186 @@ ORDER BY unique1;
       42
 (3 rows)
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ thousand | tenthous 
+----------+----------
+        0 |     3000
+        1 |     1001
+(2 rows)
+
+-- Non-required array scan key on "tenthous", backward scan:
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
+--
+-- Check elimination of redundant and contradictory index quals
+--
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = ANY('{7, 8, 9}');
+                                             QUERY PLAN                                             
+----------------------------------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 = ANY ('{7,8,9}'::integer[])))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = ANY('{7, 8, 9}');
+ unique1 
+---------
+       7
+(1 row)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 = ANY('{7, 14, 22}') and unique1 = ANY('{33, 44}'::bigint[]);
+                                             QUERY PLAN                                             
+----------------------------------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{7,14,22}'::integer[])) AND (unique1 = ANY ('{33,44}'::bigint[])))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 = ANY('{7, 14, 22}') and unique1 = ANY('{33, 44}'::bigint[]);
+ unique1 
+---------
+(0 rows)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 1;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 = 1))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 1;
+ unique1 
+---------
+       1
+(1 row)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 12345;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 = 12345))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 12345;
+ unique1 
+---------
+(0 rows)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 >= 42;
+                                 QUERY PLAN                                  
+-----------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 >= 42))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 >= 42;
+ unique1 
+---------
+      42
+(1 row)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 > 42;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 > 42))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 > 42;
+ unique1 
+---------
+(0 rows)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 > 9996 and unique1 >= 9999;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 > 9996) AND (unique1 >= 9999))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 > 9996 and unique1 >= 9999;
+ unique1 
+---------
+    9999
+(1 row)
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 <= 3;
+                    QUERY PLAN                    
+--------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 < 3) AND (unique1 <= 3))
+(2 rows)
+
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 <= 3;
+ unique1 
+---------
+       0
+       1
+       2
 (3 rows)
 
-SELECT thousand, tenthous FROM tenk1
-WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
- thousand | tenthous 
-----------+----------
-        0 |     3000
-        1 |     1001
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 < (-1)::bigint;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 < 3) AND (unique1 < '-1'::bigint))
 (2 rows)
 
-SET enable_indexonlyscan = OFF;
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 < (-1)::bigint;
+ unique1 
+---------
+(0 rows)
+
 explain (costs off)
-SELECT thousand, tenthous FROM tenk1
-WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 < (-1)::bigint;
                                       QUERY PLAN                                      
 --------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
-
-SELECT thousand, tenthous FROM tenk1
-WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
- thousand | tenthous 
-----------+----------
-        0 |     3000
-        1 |     1001
+ Index Only Scan using tenk1_unique1 on tenk1
+   Index Cond: ((unique1 = ANY ('{1,42,7}'::integer[])) AND (unique1 < '-1'::bigint))
 (2 rows)
 
-RESET enable_indexonlyscan;
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 < (-1)::bigint;
+ unique1 
+---------
+(0 rows)
+
 --
 -- Check elimination of constant-NULL subexpressions
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 63cddac0d..8b640c2fc 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8880,10 +8880,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 4ffc5b4c5..87273fa63 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -361,6 +361,7 @@ alter table tenk2 reset (parallel_workers);
 -- test parallel index scans.
 set enable_seqscan to off;
 set enable_bitmapscan to off;
+set random_page_cost = 2;
 explain (costs off)
 	select  count((unique1)) from tenk1 where hundred > 1;
                              QUERY PLAN                             
@@ -379,6 +380,30 @@ select  count((unique1)) from tenk1 where hundred > 1;
   9800
 (1 row)
 
+-- Parallel ScalarArrayOp index scan
+explain (costs off)
+  select count((unique1)) from tenk1
+  where hundred = any ((select array_agg(i) from generate_series(1, 100, 15) i)::int[]);
+                             QUERY PLAN                              
+---------------------------------------------------------------------
+ Finalize Aggregate
+   InitPlan 1
+     ->  Aggregate
+           ->  Function Scan on generate_series i
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial Aggregate
+               ->  Parallel Index Scan using tenk1_hundred on tenk1
+                     Index Cond: (hundred = ANY ((InitPlan 1).col1))
+(9 rows)
+
+select count((unique1)) from tenk1
+where hundred = any ((select array_agg(i) from generate_series(1, 100, 15) i)::int[]);
+ count 
+-------
+   700
+(1 row)
+
 -- test parallel index-only scans.
 explain (costs off)
 	select  count(*) from tenk1 where thousand > 95;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index ef8435423..0d2a33f37 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -135,6 +135,21 @@ explain (costs off)
 select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
 select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
 
+--
+-- Add coverage for ScalarArrayOp btree quals with pivot tuple constants
+--
+explain (costs off)
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82);
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82);
+
+explain (costs off)
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82) order by hundred desc;
+select distinct hundred from tenk1 where hundred in (47, 48, 72, 82) order by hundred desc;
+
+explain (costs off)
+select thousand from tenk1 where thousand in (364, 366,380) and tenthous = 200000;
+select thousand from tenk1 where thousand in (364, 366,380) and tenthous = 200000;
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..e296891ca 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -668,6 +668,7 @@ SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
 SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
 SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
 SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IN (-1, 0, 1);
 
 DROP INDEX onek_nulltest;
 
@@ -753,7 +754,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -765,6 +766,7 @@ SELECT unique1 FROM tenk1
 WHERE unique1 IN (1,42,7)
 ORDER BY unique1;
 
+-- Non-required array scan key on "tenthous":
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -774,18 +776,68 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
-SET enable_indexonlyscan = OFF;
-
+-- Non-required array scan key on "tenthous", backward scan:
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
-ORDER BY thousand;
+ORDER BY thousand DESC, tenthous DESC;
 
-RESET enable_indexonlyscan;
+--
+-- Check elimination of redundant and contradictory index quals
+--
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = ANY('{7, 8, 9}');
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = ANY('{7, 8, 9}');
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 = ANY('{7, 14, 22}') and unique1 = ANY('{33, 44}'::bigint[]);
+
+SELECT unique1 FROM tenk1 WHERE unique1 = ANY('{7, 14, 22}') and unique1 = ANY('{33, 44}'::bigint[]);
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 1;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 1;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 12345;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 = 12345;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 >= 42;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 >= 42;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 > 42;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 > 42;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 > 9996 and unique1 >= 9999;
+
+SELECT unique1 FROM tenk1 WHERE unique1 > 9996 and unique1 >= 9999;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 <= 3;
+
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 <= 3;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 < (-1)::bigint;
+
+SELECT unique1 FROM tenk1 WHERE unique1 < 3 and unique1 < (-1)::bigint;
+
+explain (costs off)
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 < (-1)::bigint;
+
+SELECT unique1 FROM tenk1 WHERE unique1 IN (1, 42, 7) and unique1 < (-1)::bigint;
 
 --
 -- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index c43a5b211..20376c03f 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -137,11 +137,19 @@ alter table tenk2 reset (parallel_workers);
 -- test parallel index scans.
 set enable_seqscan to off;
 set enable_bitmapscan to off;
+set random_page_cost = 2;
 
 explain (costs off)
 	select  count((unique1)) from tenk1 where hundred > 1;
 select  count((unique1)) from tenk1 where hundred > 1;
 
+-- Parallel ScalarArrayOp index scan
+explain (costs off)
+  select count((unique1)) from tenk1
+  where hundred = any ((select array_agg(i) from generate_series(1, 100, 15) i)::int[]);
+select count((unique1)) from tenk1
+where hundred = any ((select array_agg(i) from generate_series(1, 100, 15) i)::int[]);
+
 -- test parallel index-only scans.
 explain (costs off)
 	select  count(*) from tenk1 where thousand > 95;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fa1ede5fe..3d7611407 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -208,8 +208,10 @@ BTPageStat
 BTPageState
 BTParallelScanDesc
 BTPendingFSM
+BTReadPageState
 BTScanInsert
 BTScanInsertData
+BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos
-- 
2.43.0

#61

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#60)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Hello Peter,

03.04.2024 22:53, Peter Geoghegan wrote:

On Mon, Apr 1, 2024 at 6:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

Note: v18 doesn't have any adjustments to the costing, as originally
planned. I'll probably need to post a revised patch with improved (or
at least polished) costing in the next few days, so that others will
have the opportunity to comment before I commit the patch.

Attached is v19, which dealt with remaining concerns I had about the
costing in selfuncs.c. My current plan is to commit this on Saturday
morning (US Eastern time).

Please look at an assertion failure (reproduced starting from 5bf748b86),
triggered by the following query:
CREATE TABLE t (a int, b int);
CREATE INDEX t_idx ON t (a, b);
INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g;
ANALYZE t;
SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267

Best regards,
Alexander

#62

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Alexander Lakhin (#61)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

Please look at an assertion failure (reproduced starting from 5bf748b86),
triggered by the following query:
CREATE TABLE t (a int, b int);
CREATE INDEX t_idx ON t (a, b);
INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g;
ANALYZE t;
SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267

I immediately see what's up here. WIll fix this in the next short
while. There is no bug here in builds without assertions, but it's
probably best to keep the assertion, and to just make sure not to call
_bt_preprocess_array_keys_final() unless it has real work to do.

The assertion failure demonstrates that
_bt_preprocess_array_keys_final can currently be called when there can
be no real work for it to do. The problem is that we condition the
call to _bt_preprocess_array_keys_final() on whether or not we had to
do "real work" back when _bt_preprocess_array_keys() was called, at
the start of _bt_preprocess_keys(). That usually leaves us with
"so->numArrayKeys > 0", because the "real work" typically includes
equality type array keys. But not here -- here you have two SAOP
inequalities, which become simple scalar scan keys through early
array-specific preprocessing in _bt_preprocess_array_keys(). There are
no arrays left at this point, so "so->numArrayKeys == 0".

FWIW I missed this because the tests only cover cases with one SOAP
inequality, which will always return early from _bt_preprocess_keys
(by taking its generic single scan key fast path). Your test case has
2 scan keys, avoiding the fast path, allowing us to reach
_bt_preprocess_array_keys_final().

--
Peter Geoghegan

#63

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Peter Geoghegan (#16)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Coverity pointed out something that looks like a potentially live
problem in 5bf748b86:

/srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 2950 in _bt_preprocess_keys()
2944 * need to make sure that we don't throw away an array
2945 * scan key. _bt_compare_scankey_args expects us to
2946 * always keep arrays (and discard non-arrays).
2947 */
2948 Assert(j == (BTEqualStrategyNumber - 1));
2949 Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);

CID 1596256: Null pointer dereferences (FORWARD_NULL)
Dereferencing null pointer "array".

2950 Assert(xform[j].ikey == array->scan_key);
2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY));
2952 }
2953 }
2954 else if (j == (BTEqualStrategyNumber - 1))

Above this there is an assertion

Assert(!array || array->num_elems > 0);

which certainly makes it look like array->scan_key could be
a null-pointer dereference.

regards, tom lane

#64

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Tom Lane (#63)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sun, Apr 7, 2024 at 8:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Coverity pointed out something that looks like a potentially live
problem in 5bf748b86:

/srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 2950 in _bt_preprocess_keys()
2944 * need to make sure that we don't throw away an array
2945 * scan key. _bt_compare_scankey_args expects us to
2946 * always keep arrays (and discard non-arrays).
2947 */
2948 Assert(j == (BTEqualStrategyNumber - 1));
2949 Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);

CID 1596256: Null pointer dereferences (FORWARD_NULL)
Dereferencing null pointer "array".

2950 Assert(xform[j].ikey == array->scan_key);
2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY));
2952 }
2953 }
2954 else if (j == (BTEqualStrategyNumber - 1))

Above this there is an assertion

Assert(!array || array->num_elems > 0);

which certainly makes it look like array->scan_key could be
a null-pointer dereference.

But the "Assert(xform[j].ikey == array->scan_key)" assertion is
located in a block where it's been established that the scan key (the
one stored in xform[j] at this point in execution) must have an array.
It has been marked SK_SEARCHARRAY, and uses the equality strategy, so
it had better have one or we're in big trouble either way.

This is probably very hard for tools like Coverity to understand. We
also rely on the fact that only one of the two scan keys (only one of
the pair of scan keys that were passed to _bt_compare_scankey_args)
can have an array at the point of the assertion that Coverity finds
suspicious. It's possible that both of those scan keys actually did
have arrays, but _bt_compare_scankey_args just treats that as a case
of being unable to prove which scan key was redundant/contradictory
due to a lack of suitable cross-type support -- so the assertion won't
be reached.

Would Coverity stop complaining if I just removed the assertion? I
could just do that, I suppose, but that seems backwards to me.

--
Peter Geoghegan

#65

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Peter Geoghegan (#64)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Peter Geoghegan <pg@bowt.ie> writes:

On Sun, Apr 7, 2024 at 8:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Coverity pointed out something that looks like a potentially live
problem in 5bf748b86:
... which certainly makes it look like array->scan_key could be
a null-pointer dereference.

But the "Assert(xform[j].ikey == array->scan_key)" assertion is
located in a block where it's been established that the scan key (the
one stored in xform[j] at this point in execution) must have an array.

This is probably very hard for tools like Coverity to understand.

It's not obvious to human readers either ...

Would Coverity stop complaining if I just removed the assertion? I
could just do that, I suppose, but that seems backwards to me.

Perhaps this'd help:

-                        Assert(xform[j].ikey == array->scan_key);
+                        Assert(array && xform[j].ikey == array->scan_key);

If that doesn't silence it, I'd be prepared to just dismiss the
warning.

Some work in the comment to explain why we must have an array here
wouldn't be out of place either, perhaps.

regards, tom lane

#66

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Tom Lane (#65)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sun, Apr 7, 2024 at 9:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Perhaps this'd help:
-                        Assert(xform[j].ikey == array->scan_key);
+                        Assert(array && xform[j].ikey == array->scan_key);
If that doesn't silence it, I'd be prepared to just dismiss the
warning.

The assertions in question are arguably redundant. There are very
similar assertions just a little earlier on, as we initially set up
the array stuff (right before _bt_compare_scankey_args is called).
I'll just remove the "Assert(xform[j].ikey == array->scan_key)"
assertion that Coverity doesn't like, in addition to the
"Assert(!array || array->scan_key == i)" assertion, on the grounds
that they're redundant.

Some work in the comment to explain why we must have an array here
wouldn't be out of place either, perhaps.

There is a comment block about this right above the assertion in question:

/*
* Both scan keys might have arrays, in which case we'll
* arbitrarily pass only one of the arrays. That won't
* matter, since _bt_compare_scankey_args is aware that two
* SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
* failed to eliminate redundant arrays through array merging.
* _bt_compare_scankey_args just returns false when it sees
* this; it won't even try to examine either array.
*/

Do you think it needs more work?

--
Peter Geoghegan

#67

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Peter Geoghegan (#66)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Peter Geoghegan <pg@bowt.ie> writes:

The assertions in question are arguably redundant. There are very
similar assertions just a little earlier on, as we initially set up
the array stuff (right before _bt_compare_scankey_args is called).
I'll just remove the "Assert(xform[j].ikey == array->scan_key)"
assertion that Coverity doesn't like, in addition to the
"Assert(!array || array->scan_key == i)" assertion, on the grounds
that they're redundant.

If you're doing that, then surely

if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
...
}
else
{
Assert(j == (BTEqualStrategyNumber - 1));
Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
Assert(xform[j].ikey == array->scan_key);
Assert(!(cur->sk_flags & SK_SEARCHARRAY));
}

those first two Asserts are redundant with the "if" as well.

regards, tom lane

#68

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Tom Lane (#67)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sun, Apr 7, 2024 at 9:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If you're doing that, then surely

if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
...
}
else
{
Assert(j == (BTEqualStrategyNumber - 1));
Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
Assert(xform[j].ikey == array->scan_key);
Assert(!(cur->sk_flags & SK_SEARCHARRAY));
}

those first two Asserts are redundant with the "if" as well.

I'll get rid of those other two assertions as well, then.

--
Peter Geoghegan

#69

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Peter Geoghegan (#68)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Sun, Apr 7, 2024 at 9:57 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Apr 7, 2024 at 9:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

those first two Asserts are redundant with the "if" as well.

I'll get rid of those other two assertions as well, then.

Done that way.

--
Peter Geoghegan

#70

Donghang Lin

donghanglin@gmail.com

over 1 year ago

In reply to: Peter Geoghegan (#69)

1 attachment(s)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Hi Peter

There seems to be an assertion failure with this change in HEAD
TRAP: failed Assert("leftarg->sk_attno == rightarg->sk_attno"), File:
"../../src/backend/access/nbtree/nbtutils.c", Line: 3246, PID: 1434532

It can be reproduced by:
create table t(a int);
insert into t select 1 from generate_series(1,10);
create index on t (a desc);
set enable_seqscan = false;
select * from t where a IN (1,2) and a IN (1,2,3);

It's triggered when a scankey's strategy is set to invalid. While for a
descending ordered column,
the strategy needs to get fixed to its commute strategy. That doesn't work
if the strategy is invalid.

Attached a demo fix.

Regards,
Donghang Lin
(ServiceNow)

Attachments:

demo-fix.patchapplication/octet-stream; name=demo-fix.patchDownload

From 3aacb0f9fee72723b55369ecf2ad491a57e2f54d Mon Sep 17 00:00:00 2001
From: Donghang Lin <donghanglin@gmail.com>
Date: Wed, 17 Apr 2024 17:14:50 -0700
Subject: [PATCH] Do not flip strategy if it's invalid

---
 src/backend/access/nbtree/nbtutils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2eff34c4aa..c4db016065 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3395,7 +3395,7 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
 	}
 
 	/* Adjust strategy for DESC, if we didn't already */
-	if ((addflags & SK_BT_DESC) && !(skey->sk_flags & SK_BT_DESC))
+	if ((addflags & SK_BT_DESC) && !(skey->sk_flags & SK_BT_DESC) && skey->sk_strategy != InvalidStrategy)
 		skey->sk_strategy = BTCommuteStrategyNumber(skey->sk_strategy);
 	skey->sk_flags |= addflags;
 
-- 
2.40.1

#71

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Donghang Lin (#70)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Thu, Apr 18, 2024 at 2:13 AM Donghang Lin <donghanglin@gmail.com> wrote:

It's triggered when a scankey's strategy is set to invalid. While for a descending ordered column,
the strategy needs to get fixed to its commute strategy. That doesn't work if the strategy is invalid.

The problem is that _bt_fix_scankey_strategy shouldn't have been doing
anything with already-eliminated array scan keys in the first place
(whether or not they're on a DESC index column). I just pushed a fix
along those lines.

Thanks for the report!

--
Peter Geoghegan

#72

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Peter Geoghegan (#62)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Hello Peter,

07.04.2024 20:18, Peter Geoghegan wrote:

On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267

I immediately see what's up here. WIll fix this in the next short
while. There is no bug here in builds without assertions, but it's
probably best to keep the assertion, and to just make sure not to call
_bt_preprocess_array_keys_final() unless it has real work to do.

Please look at another case, where a similar Assert (but this time in
_bt_preprocess_keys()) is triggered:
CREATE TABLE t (a text, b text);
INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, 500) g;
CREATE INDEX t_idx ON t USING btree(a);
BEGIN;
DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a';
FETCH FROM c;
FETCH RELATIVE 0 FROM c;

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: 1130962

Best regards,
Alexander

#73

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Alexander Lakhin (#72)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Apr 22, 2024 at 4:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Please look at another case, where a similar Assert (but this time in
_bt_preprocess_keys()) is triggered:
CREATE TABLE t (a text, b text);
INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, 500) g;
CREATE INDEX t_idx ON t USING btree(a);
BEGIN;
DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a';
FETCH FROM c;
FETCH RELATIVE 0 FROM c;

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: 1130962

I'm pretty sure that I could fix this by simply removing the
assertion. But I need to think about it a bit more before I push a
fix.

The test case you've provided proves that _bt_preprocess_keys's
new no-op path isn't just used during scans that have array keys (your
test case doesn't have a SAOP at all). This was never intended. On the
other hand, I think that it's still correct (or will be once the assertion is
gone), and it seems like it would be simpler to allow this case (and
document it) than to not allow it at all.

The general idea that we only need one "real" _bt_preprocess_keys call
per btrescan (independent of the presence of array keys) still seems
sound.

--
Peter Geoghegan

#74

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Peter Geoghegan (#73)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Apr 22, 2024 at 11:13 AM Peter Geoghegan <pg@bowt.ie> wrote:

I'm pretty sure that I could fix this by simply removing the
assertion. But I need to think about it a bit more before I push a
fix.

The test case you've provided proves that _bt_preprocess_keys's
new no-op path isn't just used during scans that have array keys (your
test case doesn't have a SAOP at all). This was never intended. On the
other hand, I think that it's still correct (or will be once the assertion is
gone), and it seems like it would be simpler to allow this case (and
document it) than to not allow it at all.

Pushed a fix like that just now.

Thanks for the report, Alexander.

--
Peter Geoghegan

#75

Anton A. Melnikov

a.melnikov@postgrespro.ru

over 1 year ago

In reply to: Peter Geoghegan (#38)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Hi, Peter!

On 20.01.2024 01:41, Peter Geoghegan wrote:

It is quite likely that there are exactly zero affected out-of-core
index AMs. I don't count pgroonga as a counterexample (I don't think
that it actually fullfills the definition of a ). Basically,
"amcanorder" index AMs more or less promise to be compatible with
nbtree, down to having the same strategy numbers. So the idea that I'm
going to upset amsearcharray+amcanorder index AM authors is a
completely theoretical problem. The planner code evolved with nbtree,
hand-in-glove.

From the 5bf748b86bc commit message:

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions. Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.

Seems, this commit broke our posted knn_btree patch. [1]https://commitfest.postgresql.org/48/4871/
If the point from which ORDER BY goes by distance is greater than the elements of ScalarArrayOp,
then knn_btree algorithm will give only the first tuple. It sorts the elements of ScalarArrayOp
in descending order and starts searching from smaller to larger
and always expects that for each element of ScalarArrayOp there will be a separate scan.
And now it does not work. Reproduction is described in [2]/messages/by-id/47adb0b0-6e65-4b40-8d93-20dcecc21395@postgrespro.ru.

Seems it is impossible to solve this problem only from the knn-btree patch side.
Could you advise any ways how to deal with this. Would be very grateful.

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[1]: https://commitfest.postgresql.org/48/4871/
https://commitfest.postgresql.org/48/4871/
[2]: /messages/by-id/47adb0b0-6e65-4b40-8d93-20dcecc21395@postgrespro.ru
/messages/by-id/47adb0b0-6e65-4b40-8d93-20dcecc21395@postgrespro.ru

#76

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Anton A. Melnikov (#75)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Wed, Jul 31, 2024 at 12:47 AM Anton A. Melnikov
<a.melnikov@postgrespro.ru> wrote:

From the 5bf748b86bc commit message:

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions. Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.

Here you've quoted the commit message's description of the risks
around breaking third party index AMs maintained as extensions. FWIW
that seems like a rather different problem to the one that you ran
into with your KNN nbtree patch.

Seems, this commit broke our posted knn_btree patch. [1]
If the point from which ORDER BY goes by distance is greater than the elements of ScalarArrayOp,
then knn_btree algorithm will give only the first tuple. It sorts the elements of ScalarArrayOp
in descending order and starts searching from smaller to larger
and always expects that for each element of ScalarArrayOp there will be a separate scan.
And now it does not work. Reproduction is described in [2].

Seems it is impossible to solve this problem only from the knn-btree patch side.
Could you advise any ways how to deal with this. Would be very grateful.

Well, you're actually patching nbtree. Your patch isn't upstream of
the nbtree code. You're going to have to reconcile your design with
the design for advancing arrays that is primarily implemented by
_bt_advance_array_keys.

I'm not entirely sure what the best way to go about doing that is --
that would require a lot of context about the KNN patch. It wouldn't
be particularly hard to selectively reenable the old Postgres 16 SAOP
behavior, though. You could conditionally force "goto new_prim_scan"
whenever the KNN patch's mechanism was in use.

That kind of approach might have unintended consequences elsewhere,
though. For example it could break things like mark/restore processing
by merge joins, which assumes that the scan's current array keys can
always be reconstructed after restoring a marked position by resetting
the array's to their first elements (or final elements if it's a
backwards scan). On the other hand, I think that you already shouldn't
expect anything like that to work with your patch. After all, it
changes the order of the tuples returned by the scan, which is already
bound to break certain assumptions made by "regular" index scans.

--
Peter Geoghegan

#77

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Peter Geoghegan (#74)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Hello Peter,

22.04.2024 20:59, Peter Geoghegan wrote:

Pushed a fix like that just now.

I'm sorry to bother you again, but I've come across another assertion
failure. Please try the following query (I use a clean "postgres" database,
just after initdb):
EXPLAIN SELECT conname
FROM pg_constraint WHERE conname IN ('pkey', 'ID')
ORDER BY conname DESC;

SELECT conname
FROM pg_constraint WHERE conname IN ('pkey', 'ID')
ORDER BY conname DESC;

It fails for me as below:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Only Scan Backward using pg_constraint_conname_nsp_index on pg_constraint (cost=0.14..4.18 rows=2 width=64)
Index Cond: (conname = ANY ('{pkey,ID}'::name[]))
(2 rows)

server closed the connection unexpectedly
...
with the stack trace:
...
#5 0x000055a49f81148d in ExceptionalCondition (conditionName=0x55a49f8bb540 "ItemIdHasStorage(itemId)",
    fileName=0x55a49f8bb4a8 "../../../../src/include/storage/bufpage.h", lineNumber=355) at assert.c:66
#6 0x000055a49f0f2ddd in PageGetItem (page=0x7f97cbf17000 "", itemId=0x7f97cbf2f064)
    at ../../../../src/include/storage/bufpage.h:355
#7 0x000055a49f0f9367 in _bt_checkkeys_look_ahead (scan=0x55a4a0ac4548, pstate=0x7ffd1a103670, tupnatts=2,
    tupdesc=0x7f97cb5d7be8) at nbtutils.c:4105
#8 0x000055a49f0f8ac3 in _bt_checkkeys (scan=0x55a4a0ac4548, pstate=0x7ffd1a103670, arrayKeys=true,
    tuple=0x7f97cbf18890, tupnatts=2) at nbtutils.c:3612
#9 0x000055a49f0ebb4b in _bt_readpage (scan=0x55a4a0ac4548, dir=BackwardScanDirection, offnum=20, firstPage=true)
    at nbtsearch.c:1863
...
(gdb) f 7
#7 0x000055a49f0f9367 in _bt_checkkeys_look_ahead (scan=0x55a4a0ac4548, pstate=0x7ffd1a103670, tupnatts=2,
    tupdesc=0x7f97cb5d7be8) at nbtutils.c:4105
4105            ahead = (IndexTuple) PageGetItem(pstate->page,
(gdb) p aheadoffnum
$1 = 24596
(gdb) p pstate->offnum
$2 = 20
(gdb) p pstate->targetdistance
$3 = -24576

Best regards,
Alexander

#78

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Alexander Lakhin (#77)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

On Mon, Aug 26, 2024 at 10:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I'm sorry to bother you again, but I've come across another assertion
failure.

You've found a real bug. I should be the one apologizing - not you.

Please try the following query (I use a clean "postgres" database,
just after initdb):
EXPLAIN SELECT conname
FROM pg_constraint WHERE conname IN ('pkey', 'ID')
ORDER BY conname DESC;

SELECT conname
FROM pg_constraint WHERE conname IN ('pkey', 'ID')
ORDER BY conname DESC;

The problem is that _bt_checkkeys_look_ahead didn't quite get
everything right with sanitizing the page offset number it uses to
check if a later tuple is before the recently advanced array scan
keys. The page offset itself was checked, but in a way that was faulty
in cases where the int16 we use could overflow.

I can fix the bug by making sure that pstate->targetdistance (an int16
variable) can only be doubled when it actually makes sense. That way
there can never be an int16 overflow, and so the final offnum cannot
underflow to a value that's much higher than the page's max offset
number.

This approach works:

     /*
      * The look ahead distance starts small, and ramps up as each call here
      * allows _bt_readpage to skip over more tuples
      */
     if (!pstate->targetdistance)
         pstate->targetdistance = LOOK_AHEAD_DEFAULT_DISTANCE;
-    else
+    else if (pstate->targetdistance < MaxIndexTuplesPerPage / 2)
         pstate->targetdistance *= 2;

I'll push a fix along these lines shortly.

Thanks for the report!
--
Peter Geoghegan