Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
I've been working on a variety of improvements to nbtree's native
ScalarArrayOpExpr execution. This builds on Tom's work in commit
9e8da0f7.
Attached patch is still at the prototype stage. I'm posting it as v1 a
little earlier than I usually would because there has been much back
and forth about it on a couple of other threads involving Tomas Vondra
and Jeff Davis -- seems like it would be easier to discuss with
working code available.
The patch adds two closely related enhancements to ScalarArrayOp
execution by nbtree:
1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
"advance the scan's array keys locally", which sometimes avoids
significant amounts of unneeded pinning/locking of the same set of
index pages.
SAOP index scans become capable of eliding primitive index scans for
the next set of array keys in line in cases where it isn't truly
necessary to descend the B-Tree again. Index scans are now capable of
"sticking with the existing leaf page for now" when it is determined
that the end of the current set of array keys is physically close to
the start of the next set of array keys (the next set in line to be
materialized by the _bt_advance_array_keys state machine). This is
often possible.
Naturally, we still prefer to advance the array keys in the
traditional way ("globally") much of the time. That means we'll
perform another _bt_first/_bt_search descent of the index, starting a
new primitive index scan. Whether we try to skip pages on the leaf
level or stick with the current primitive index scan (by advancing
array keys locally) is likely to vary a great deal. Even during the
same index scan. Everything is decided dynamically, which is the only
approach that really makes sense.
This optimization can significantly lower the number of buffers pinned
and locked in cases with significant locality, and/or with many array
keys with no matches. The savings (when measured in buffers
pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
has shown that transaction throughput for variants of "pgbench -S"
designed to stress the implementation (hundreds of array constants)
under concurrent load can have up to 5.5x higher transaction
throughput with the patch. Less extreme cases (10 array constants,
spaced apart) see about a 20% improvement in throughput. There are
similar improvements to latency for the patch, in each case.
2. The optimizer now produces index paths with multiple SAOP clauses
(or other clauses we can safely treat as "equality constraints'') on
each of the leading columns from a composite index -- all while
preserving index ordering/useful pathkeys in most cases.
The nbtree work from item 1 is useful even with the simplest IN() list
query involving a scan of a single column index. Obviously, it's very
inefficient for the nbtree code to use 100 primitive index scans when
1 is sufficient. But that's not really why I'm pursuing this project.
My real goal is to implement (or to enable the implementation of) a
whole family of useful techniques for multi-column indexes. I call
these "MDAM techniques", after the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1]http://vldb.org/conf/1995/P710.PDF[2]/messages/by-id/2587523.1647982549@sss.pgh.pa.us-- MDAM is short for "multidimensional
access method". In the context of the paper, "dimension" refers to
dimensions in a decision support system.
The most compelling cases for the patch all involve multiple index
columns with multiple SAOP clauses (especially where each column
represents a separate "dimension", in the DSS sense). It's important
that index sort be preserved whenever possible, too. Sometimes this is
directly useful (e.g., because the query has an ORDER BY), but it's
always indirectly needed, on the nbtree side (when the optimizations
are applicable at all). The new nbtree code now has special
requirements surrounding SAOP search type scan keys with composite
indexes. These requirements make changes in the optimizer all but
essential.
Index order
===========
As I said, there are cases where preserving index order is immediately
and obviously useful, in and of itself. Let's start there.
Here's a test case that you can run against the regression test database:
pg@regression:5432 =# create index order_by_saop on tenk1(two,four,twenty);
CREATE INDEX
pg@regression:5432 =# EXPLAIN (ANALYZE, BUFFERS)
select ctid, thousand from tenk1
where two in (0,1) and four in (1,2) and twenty in (1,2)
order by two, four, twenty limit 20;
With the patch, this query gets 13 buffer hits. On the master branch,
it gets 1377 buffer hits -- which exceeds the number you'll get from a
sequential scan by about 4x. No coaxing was required to get the
planner to produce this plan on the master branch. Almost all of the
savings shown here are related to heap page buffer hits -- the nbtree
changes don't directly help in this particular example (strictly
speaking, you only need the optimizer changes to get this result).
Obviously, the immediate reason why the patch wins by so much is
because it produces a plan that allows the LIMIT to terminate the scan
far sooner. Benoit Tigeot (CC'd) happened to run into this issue
organically -- that was also due to heap hits, a LIMIT, and so on. As
luck would have it, I stumbled upon his problem report (in the
Postgres slack channel) while I was working on this patch. He produced
a fairly complete test case, which was helpful [3]https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d -- Peter Geoghegan. This example is
more or less just a distillation of his test case, designed to be easy
for a Postgres hacker to try out for themselves.
There are also variants of this query where a LIMIT isn't the crucial
factor, and where index page hits are the problem. This query uses an
index-only scan, both on master and with the patch (same index as
before):
select count(*), two, four, twenty
from tenk1
where two in (0, 1) and four in (1, 2, 3, 4) and
twenty in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14,15)
group by two, four, twenty
order by two, four, twenty;
The patch gets 18 buffer hits for this query. That outcome makes
intuitive sense, since this query is highly unselective -- it's
approaching the selectivity of the query "select count(*) from tenk1".
The simple count(*) query gets 19 buffer hits for its own index-only
scan, confirming that the patch managed to skip only one or two leaf
pages in the complicated "group by" variant of the count(*) query.
Overall, the GroupAggregate plan used by the patch is slower than the
simple count(*) case (despite touching fewer pages). But both plans
have *approximately* the same execution cost, which makes sense, since
they both have very similar selectivities.
The master branch gets 245 buffer hits for the same group by query.
This is almost as many hits as a sequential scan would require -- even
though there are precisely zero heap accesses needed by the underlying
index-only scan. As with the first example, no planner coaxing was
required to get this outcome on master. It is inherently very
difficult to predict how selective a query like this will be using
conventional statistics. But that's not actually the problem in this
example -- the planner gets that part right, on this occasion. The
real problem is that there is a multiplicative factor to worry about
on master, when executing multiple SAOPs. That makes it almost
impossible to predict the number of pages we'll pin. While with the
patch, scans with multiple SAOPs are often fairly similar to scans
that happen to just have one on the leading column.
With the patch, it is simply impossible for an SAOP index scan to
visit any single leaf page more than once. Just like a conventional
index scan. Whereas right now, on master, using more than one SAOP
clause for a multi column index seems to me to be a wildly risky
proposition. You can easily have cases that work just fine on master,
while only slight variations of the same query see costs explode
(especially likely with a LIMIT). ISTM that there is significant value
in knowing for sure in having a pretty accurate idea of the worst case
in the planner.
Giving nbtree the ability to skip or not skip dynamically, based on
actual conditions in the index (not on statistics), seems like it has
a lot of potential as a way of improving performance *stability*.
Personally I'm most interested in this aspect of the project.
Note: we can visit internal pages more than once, but that seems to
make a negligible difference to the overall cost profile of scans. Our
policy is to not charge an I/O cost for those pages. Plus, the number
of internal page access is dramatically reduced (it's just not
guaranteed that there won't be any repeat accesses for internal pages,
is all).
Note also: there are hard-to-pin-down interactions between the
immediate problem on the nbtree side, and the use of filter quals
rather than true index quals, where the use of index quals is possible
in principle. Some problematic cases see excessive amounts of heap
page hits only (as with my first example query). Other problematic
cases see excessive amounts of index page hits, with little to no
impact on heap page hits at all (as with my second example query).
Some combination of the two is also possible.
Safety
======
As mentioned already, the ability to "advance the current set of array
keys locally" during a scan (the nbtree work in item 1) actually
relies the optimizer work in item 2 -- it's not just a question of
unlocking the potential of the nbtree work. Now I'll discuss those
aspects in a bit more detail.
Without the optimizer work, nbtree will produce wrong answers to
queries, in a way that resembles the complaint addressed by historical
bugfix commit 807a40c5. This incorrect behavior (if the optimizer were
to permit it) would only be seen when there are multiple
arrays/columns, and an inequality on a leading column -- just like
with that historical bug. (It works both ways, though -- the nbtree
changes also make the optimizer changes safe by limiting the worst
case, which would otherwise be too much of a risk to countenance. You
can't separate one from the other.)
The primary change on the optimizer side is the addition of logic to
differentiate between the following two cases when building an index
path in indxpath.c:
* Unsafe: Cases where it's fundamentally unsafe to treat
multi-column-with-SAOP-clause index paths as returning tuples in a
useful sort order.
For example, the test case committed as part of that bugfix involves
an inequality, so it continues to be treated as unsafe.
* Safe: Cases where (at least in theory) bugfix commit 807a40c5 went
further than it really had to.
Those cases get to use the optimization, and usually get to have
useful path keys.
My optimizer changes are very kludgey. I came up with various ad-hoc
rules to distinguish between the safe and unsafe cases, without ever
really placing those changes into some kind of larger framework. That
was enough to validate the general approach in nbtree, but it
certainly has problems -- glaring problems. The biggest problem of all
may be my whole "safe vs unsafe" framing itself. I know that many of
the ostensibly unsafe cases are in fact safe (with the right
infrastructure in place), because the MDAM paper says just that. The
optimizer can't support inequalities right now, but the paper
describes how to support "NOT IN( )" lists -- clearly an inequality!
The current ad-hoc rules are at best incomplete, and at worst are
addressing the problem in fundamentally the wrong way.
CNF -> DNF conversion
=====================
Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".
In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).
The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]/messages/by-id/2587523.1647982549@sss.pgh.pa.us. Much work remains here.
Skip Scan
=========
MDAM encompasses something that people tend to call "skip scan" --
terminology with a great deal of baggage. These days I prefer to call
it "filling in missing key predicates", per the paper. That's much
more descriptive, and makes it less likely that people will conflate
the techniques with InnoDB style "loose Index scans" -- the latter is
a much more specialized/targeted optimization. (I now believe that
these are very different things, though I was thrown off by the
superficial similarities for a long time. It's pretty confusing.)
I see this work as a key enabler of "filling in missing key
predicates". MDAM describes how to implement this technique by
applying the same principles that it applies everywhere else: it
proposes a scheme that converts predicates from CNF to DNF. With just
a little extra logic required to do index probes to feed the
DNF-generating state machine, on demand.
More concretely, in Postgres terms: skip scan can be implemented by
inventing a new placeholder clause that can be composed alongside
ScalarArrayOpExprs, in the same way that multiple ScalarArrayOpExprs
can be composed together in the patch already. I'm thinking of a type
of clause that makes the nbtree code materialize a set of "array keys"
for a SK_SEARCHARRAY scan key dynamically, via ad-hoc index probes
(perhaps static approaches would be better for types like boolean,
which the paper contemplates). It should be possible to teach the
_bt_advance_array_keys state machine to generate those values in
approximately the same fashion as it already does for
ScalarArrayOpExprs -- and, it shouldn't be too hard to do it in a
localized fashion, allowing everything else to continue to work in the
same way without any special concern. This separation of concerns is a
nice consequence of the way that the MDAM design really leverages
preprocessing/DNF for everything.
Both types of clauses can be treated as part of a general class of
ScalarArrayOpExpr-like clauses. Making the rules around
"composability" simple will be important.
Although skip scan gets a lot of attention, it's not necessarily the
most compelling MDAM technique. It's also not especially challenging
to implement on top of everything else. It really isn't that special.
Right now I'm focussed on the big picture, in any case. I want to
emphasize the very general nature of these techniques. Although I'm
focussed on SOAPs in the short term, many queries that don't make use
of SAOPs should ultimately see similar benefits. For example, the
paper also describes transformations that apply to BETWEEN/range
predicates. We might end up needing a third type of expression for
those. They're all just DNF single value predicates, under the hood.
Thoughts?
[1]: http://vldb.org/conf/1995/P710.PDF
[2]: /messages/by-id/2587523.1647982549@sss.pgh.pa.us
[3]: https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
v1-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v1-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload
From d4459fe464d41bdd3fa5e81b310b095560f4f5b0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v1] Enhance nbtree ScalarArrayOp execution.
Teach nbtree to avoid primitive index when executing a scan with
ScalarArrayOp keys.
---
src/include/access/nbtree.h | 46 +-
src/backend/access/nbtree/nbtree.c | 21 +-
src/backend/access/nbtree/nbtsearch.c | 85 +++-
src/backend/access/nbtree/nbtutils.c | 589 +++++++++++++++++++++++++-
src/backend/optimizer/path/indxpath.c | 206 +++++++--
src/backend/utils/adt/selfuncs.c | 56 ++-
6 files changed, 919 insertions(+), 84 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa797..5935dbc86 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1034,6 +1034,42 @@ typedef struct BTArrayKeyInfo
Datum *elem_values; /* array of num_elems Datums */
} BTArrayKeyInfo;
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ */
+typedef struct BTReadPageState
+{
+ /*
+ * Input parameters set by _bt_readpage, for _bt_checkkeys.
+ *
+ * dir: scan direction
+ *
+ * highkey: page high key
+ *
+ * SK_SEARCHARRAY forward scans are required to set the page high key up
+ * front.
+ */
+ ScanDirection dir;
+ IndexTuple highkey;
+
+ /*
+ * Output parameters set by _bt_checkkeys, for _bt_readpage.
+ *
+ * continuescan: Is there a need to continue the scan beyond this tuple?
+ */
+ bool continuescan;
+
+ /*
+ * Private _bt_checkkeys state, describes caller's page.
+ *
+ * match_for_cur_array_keys: _bt_checkkeys returned true once or more?
+ *
+ * highkeychecked: Current set of array keys checked against high key?
+ */
+ bool match_for_cur_array_keys;
+ bool highkeychecked;
+} BTReadPageState;
+
typedef struct BTScanOpaqueData
{
/* these fields are set by _bt_preprocess_keys(): */
@@ -1047,7 +1083,9 @@ typedef struct BTScanOpaqueData
* there are any unsatisfiable array keys) */
int arrayKeyCount; /* count indicating number of array scan keys
* processed */
+ bool arrayKeysStarted; /* Scan still processing array keys? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
+ BTScanInsert arrayPoskey; /* initial positioning insertion scan key */
MemoryContext arrayContext; /* scan-lifespan context for array data */
/* info about killed items if any (killedItems is NULL if never used) */
@@ -1253,8 +1291,12 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
+extern void _bt_array_keys_save_scankeys(IndexScanDesc scan,
+ BTScanInsert inskey);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, bool final,
+ BTReadPageState *pstate);
+extern void _bt_checkfinalkeys(IndexScanDesc scan, BTReadPageState *pstate);
+extern bool _bt_nocheckkeys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4553aaee5..7ccd5f3f3 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -363,7 +363,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->arrayKeyData = NULL; /* assume no array keys for now */
so->numArrayKeys = 0;
+ so->arrayKeysStarted = false;
so->arrayKeys = NULL;
+ so->arrayPoskey = NULL;
so->arrayContext = NULL;
so->killedItems = NULL; /* until needed */
@@ -404,6 +406,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
so->markItemIndex = -1;
so->arrayKeyCount = 0;
+ so->arrayKeysStarted = false;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
@@ -752,7 +755,23 @@ _bt_parallel_done(IndexScanDesc scan)
* keys.
*
* Updates the count of array keys processed for both local and parallel
- * scans.
+ * scans. (XXX Really? Then why is "scan->parallel_scan != NULL" used as a
+ * gating condition by our caller?)
+ *
+ * XXX Local advancement of array keys occurs dynamically, and affects the
+ * top-level scan state. This is at odds with how parallel scans deal with
+ * array key advancement here, so for now we just don't support them at all.
+ *
+ * The issue here is that the leader instructs workers to process array keys
+ * in whatever order is convenient, without concern for repeat or concurrent
+ * accesses to the same physical leaf pages by workers. This can be addressed
+ * by assigning batches of array keys to workers. Each individual batch would
+ * match a range from the key space covered by some specific leaf page. That
+ * whole approach requires dynamic back-and-forth key space partitioning.
+ *
+ * It seems important that parallel index scans match serial index scans in
+ * promising that no single leaf page will be accessed more than once. That
+ * makes reasoning about the worst case much easier when costing scans.
*/
void
_bt_parallel_advance_array_keys(IndexScanDesc scan)
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3230b3b89..dcf399acd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -890,6 +890,18 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(!BTScanPosIsValid(so->currPos));
+ /*
+ * XXX Queries with SAOPs have always accounted for each call here as one
+ * "index scan". This meant that the accounting showed one index scan per
+ * distinct SAOP constant. This approach is consistent with how it was
+ * done before nbtree was taught to handle ScalarArrayOpExpr quals itself
+ * (it's also how non-amsearcharray index AMs still do it).
+ *
+ * Right now, eliding a primitive index scan elides a call here, resulting
+ * in one less "index scan" recorded by pgstat. This seems defensible,
+ * though not necessarily desirable. Now implementation details can have
+ * a significant impact on user-visible index scan counts.
+ */
pgstat_count_index_scan(rel);
/*
@@ -1370,6 +1382,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
inskey.scantid = NULL;
inskey.keysz = keysCount;
+ /*
+ * Save insertion scan key for SK_SEARCHARRAY scans, which need it to
+ * advance the scan's array keys locally
+ */
+ if (so->numArrayKeys > 0)
+ _bt_array_keys_save_scankeys(scan, &inskey);
+
/*
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
@@ -1548,9 +1567,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
BTPageOpaque opaque;
OffsetNumber minoff;
OffsetNumber maxoff;
+ BTReadPageState pstate;
int itemIndex;
- bool continuescan;
- int indnatts;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1570,8 +1588,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
}
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ pstate.dir = dir;
+ pstate.highkey = NULL;
+ pstate.continuescan = true; /* default assumption */
+ pstate.match_for_cur_array_keys = false;
+ pstate.highkeychecked = false;
+
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
@@ -1606,6 +1628,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (ScanDirectionIsForward(dir))
{
+ /* SK_SEARCHARRAY scans must provide high key up front */
+ if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+ }
+
/* load items[] in ascending order */
itemIndex = 0;
@@ -1628,7 +1658,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ if (_bt_checkkeys(scan, itup, false, &pstate))
{
/* tuple passes all scan key conditions */
if (!BTreeTupleIsPosting(itup))
@@ -1661,7 +1691,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
/* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
+ if (!pstate.continuescan)
break;
offnum = OffsetNumberNext(offnum);
@@ -1678,17 +1708,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* only appear on non-pivot tuples on the right sibling page are
* common.
*/
- if (continuescan && !P_RIGHTMOST(opaque))
+ if (pstate.continuescan)
{
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
+ if (!P_RIGHTMOST(opaque) && !pstate.highkey)
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+ }
+
+ _bt_checkfinalkeys(scan, &pstate);
}
- if (!continuescan)
+ if (!pstate.continuescan)
so->currPos.moreRight = false;
Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1722,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
*/
if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
{
- Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
+ Assert(offnum >= minoff);
+ if (offnum > minoff)
{
offnum = OffsetNumberPrev(offnum);
continue;
@@ -1736,8 +1768,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
+ passes_quals = _bt_checkkeys(scan, itup, offnum == minoff,
+ &pstate);
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions */
@@ -1776,16 +1808,25 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
}
- if (!continuescan)
- {
- /* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
+ /* When !continuescan, there can't be any more matches, so stop */
+ if (!pstate.continuescan)
break;
- }
offnum = OffsetNumberPrev(offnum);
}
+ /*
+ * Backward scans never check the high key, but must still call
+ * _bt_nocheckkeys when they reach the last page (the leftmost page)
+ * without any tuple ever setting continuescan to false.
+ */
+ if (pstate.continuescan && P_LEFTMOST(opaque) &&
+ _bt_nocheckkeys(scan, dir))
+ pstate.continuescan = false;
+
+ if (!pstate.continuescan)
+ so->currPos.moreLeft = false;
+
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4d..af8accbd3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -45,11 +45,19 @@ static int _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
+static bool _bt_advance_array_keys_locally(IndexScanDesc scan,
+ IndexTuple tuple, bool final,
+ BTReadPageState *pstate);
+static bool _bt_tuple_advances_keys(IndexScanDesc scan, IndexTuple tuple,
+ ScanDirection dir);
static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
ScanKey leftarg, ScanKey rightarg,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanKey keyData, int keysz,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan);
static bool _bt_check_rowcompare(ScanKey skey,
IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
ScanDirection dir, bool *continuescan);
@@ -202,6 +210,29 @@ _bt_freestack(BTStack stack)
* array keys, it's sufficient to find the extreme element value and replace
* the whole array with that scalar value.
*
+ * It's important that we consistently avoid leaving behind SK_SEARCHARRAY
+ * inequalities after preprocessing, since _bt_advance_array_keys_locally
+ * expects to be able to treat SK_SEARCHARRAY keys as equality constraints.
+ * This makes it possible for the scan to take advantage of naturally occuring
+ * locality to avoid continually redescending the index in _bt_first. We can
+ * advance the array keys opportunistically inside _bt_check_array_keys. This
+ * won't affect the externally visible behavior of the scan.
+ *
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved). It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
+ * We always sort and deduplicate arrays up-front for equality array keys.
+ * ScalarArrayOpExpr execution need only visit leaf pages that might contain
+ * matches exactly once, while preserving the sort order of the index. This
+ * isn't just about performance; it also avoids needing duplicate elimination
+ * of matching TIDs (we prefer deduplicating search keys once, up-front).
+ * Equality SK_SEARCHARRAY keys are disjuncts that we always process in
+ * index/key space order, which makes this general approach feasible. Every
+ * index tuple will match no more than one single distinct combination of
+ * equality-constrained keys (array keys and other equality keys).
+ *
* Note: the reason we need so->arrayKeyData, rather than just scribbling
* on scan->keyData, is that callers are permitted to call btrescan without
* supplying a new set of scankey data.
@@ -539,6 +570,9 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
curArrayKey->cur_elem = 0;
skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
}
+
+ /* Tell _bt_advance_array_keys to advance array keys when called */
+ so->arrayKeysStarted = true;
}
/*
@@ -546,6 +580,10 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
*
* Returns true if there is another set of values to consider, false if not.
* On true result, the scankeys are initialized with the next set of values.
+ *
+ * On false result, local advancement of the array keys has reached the end of
+ * each of the arrays for the current scan direction. Only our btgettuple and
+ * btgetbitmap callers should rely on this.
*/
bool
_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
@@ -554,6 +592,9 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
bool found = false;
int i;
+ if (!so->arrayKeysStarted)
+ return false;
+
/*
* We must advance the last array key most quickly, since it will
* correspond to the lowest-order index column among the available
@@ -594,6 +635,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
break;
}
+ /* Scan reached the end of its array keys in the current scan direction */
+ if (!found)
+ so->arrayKeysStarted = false;
+
/* advance parallel scan */
if (scan->parallel_scan != NULL)
_bt_parallel_advance_array_keys(scan);
@@ -601,6 +646,391 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
return found;
}
+/*
+ * Check if we need to advance SK_SEARCHARRAY array keys when _bt_checkkeys
+ * returns false and sets continuescan=false. It's possible that the tuple
+ * will be a match after we advance the array keys.
+ *
+ * It is often possible for SK_SEARCHARRAY scans to skip one or more primitive
+ * index scans. Starting a new primitive scan is only required when it is
+ * truly necessary to reposition the top-level scan to some distant leaf page
+ * (where the start of the key space for the next set of search keys begins).
+ * This process (redescending the index) is implemented by calling _bt_first
+ * after the array keys are "globally advanced" by the top-level index scan.
+ *
+ * Starting a new primitive index scan is avoided whenever the end of matches
+ * for the current set of array keys happens to be physically close to the
+ * start of matches for the next set of array keys. The technique isn't used
+ * when matches for the next set of array keys aren't found on the same leaf
+ * page (unless there is good reason to believe that a visit to the next leaf
+ * page needs to take place).
+ *
+ * In the worst case the top-level index scan performs one primitive index
+ * scan per distinct set of array/search keys. In the best case we require
+ * only a single primitive index scan for the entire top-level index scan
+ * (this is even possible with arbitrarily-many distinct sets of array keys).
+ * The optimization is particularly effective with queries that have several
+ * SK_SEARCHARRAY keys (one per index column) when scanning a composite index.
+ * Most individual search key combinations (which are simple conjunctions) may
+ * well turn out to have no matching index tuples.
+ *
+ * Returns false when array keys have not or cannot advance. A new primitive
+ * index scan will be required -- except when the top-level, btrescan-wise
+ * index scan has processed all array keys in the current scan direction.
+ *
+ * Returns true when array keys were advanced "locally". Caller must recheck
+ * the tuple that initially set continuescan=false against the new array keys.
+ * At this point the newly advanced array keys are provisional. The "current"
+ * keys only get "locked in" to the ongoing primitive scan when _bt_checkkeys
+ * returns its first match for the keys. This must happen almost immediately;
+ * we should only invest in eliding primitive index scans when we're almost
+ * certain that it'll work out.
+ *
+ * Note: The fact that we only advance array keys "provisionally" imposes a
+ * requirement on _bt_readpage: it must call _bt_checkfinalkeys whenever its
+ * scan of a leaf page wasn't terminated when it called _bt_checkkeys against
+ * non-pivot tuples. This scheme ensures that we'll always have at least one
+ * opportunity to change our minds per leaf page scanned (even, say, on a page
+ * that only contains non-pivot tuples whose LP_DEAD bits are set).
+ *
+ * Note: We can determine that the next leaf page ought to be handled by the
+ * ongoing primitive index scan without being fully sure that it'll work out.
+ * This occasionally results in primitive index scans that waste cycles on a
+ * useless visit to an extra page, which then terminates the primitive scan.
+ * Such wasted accesses are only possible when the high key (or the final key
+ * in the case of backwards scans) is within the bounds of the latest set of
+ * array keys that the primitive scan can advance to.
+ *
+ * Note: There are cases where we visit the next leaf page during a primitive
+ * index scan without being completely certain about whether or not we really
+ * need to visit that page at all. In other words, sometimes we speculatively
+ * visit the next leaf page, which risks wasting a leaf page access.
+ */
+static bool
+_bt_advance_array_keys_locally(IndexScanDesc scan, IndexTuple tuple,
+ bool final, BTReadPageState *pstate)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(!pstate->continuescan);
+ Assert(so->arrayKeysStarted);
+
+ if (!so->arrayPoskey)
+ {
+ /*
+ * Scans that lack an initial positioning key (and so must go through
+ * _bt_endpoint rather than calling _bt_search from _bt_first) are not
+ * capable of locally advancing array keys
+ */
+ return false;
+ }
+
+ /*
+ * Current search type scan keys (including current array keys) indicated
+ * that this tuple terminates the scan in _bt_checkkeys caller. Can this
+ * tuple be a match for later sets of array keys, once advanced?
+ */
+ if (!_bt_tuple_advances_keys(scan, tuple, pstate->dir))
+ {
+ /*
+ * Tuple definitely isn't a match for any set of search keys. Tuple
+ * definitely won't be returned by _bt_checkkeys. Now we need to
+ * determine if the scan will continue to the next tuple/page.
+ *
+ * If this is a forwards scan, check the high key -- page state
+ * stashes it in order to allow us to terminate processing of a page
+ * (and the primitive index scan as a whole) early.
+ *
+ * If this is a backwards scan, treat the first non-pivot tuple as a
+ * stand-in for the page high key. Unlike the forward scan case, this
+ * is only possible when _bt_checkkeys reaches the final tuple on the
+ * page. (Only the more common forward scan case has the ability to
+ * end the scan of an individual page early using the high key because
+ * we always have the high key stashed.)
+ *
+ * This always needs to happen before we leave each leaf page, for all
+ * sets of array keys up to and including the last set we advance to.
+ * We must avoid becoming confused about which primitive index scan
+ * (the current or the next) returns matches for any set of array
+ * keys.
+ */
+ if (!pstate->match_for_cur_array_keys &&
+ (final || (!pstate->highkeychecked && pstate->highkey)))
+ {
+ Assert(ScanDirectionIsForward(pstate->dir) || !pstate->highkey);
+ Assert(ScanDirectionIsBackward(pstate->dir) || !final);
+
+ pstate->highkeychecked = true; /* iff this is a forward scan */
+
+ if (final || !_bt_tuple_advances_keys(scan, pstate->highkey,
+ pstate->dir))
+ {
+ /*
+ * We're unlikely to find any further matches for the current
+ * set of array keys on the next sibling leaf page.
+ *
+ * Back up the array keys so that btgettuple or btgetbitmap
+ * won't advance the keys past the now-current set. This is
+ * safe because we haven't returned any tuples matching this
+ * set of keys.
+ */
+ ScanDirection flipdir = -pstate->dir;
+
+ if (!_bt_advance_array_keys(scan, flipdir))
+ Assert(false);
+
+ _bt_preprocess_keys(scan);
+
+ /* End the current primitive index scan */
+ pstate->continuescan = false; /* redundant */
+ return false;
+ }
+ }
+
+ /*
+ * Continue the current primitive index scan. Returning false
+ * indicates that we're done with this tuple. The ongoing primitive
+ * index scan will proceed to the next non-pivot tuple on this page
+ * (or to the first non-pivot tuple on the next page).
+ */
+ pstate->continuescan = true;
+ return false;
+ }
+
+ if (!_bt_advance_array_keys(scan, pstate->dir))
+ {
+ Assert(!so->arrayKeysStarted);
+
+ /*
+ * Ran out of array keys to advance the scan to. The top-level,
+ * btrescan-wise scan has been terminated by this tuple.
+ */
+ pstate->continuescan = false; /* redundant */
+ return false;
+ }
+
+ /*
+ * Successfully advanced the array keys. We'll now need to see what
+ * _bt_checkkeys loop says about the same tuple with this new set of keys.
+ *
+ * Advancing the arrays keys is only provisional at this point. If there
+ * are no matches for the new array keys before we leave the page, and
+ * high key check indicates that there is little chance of finding any
+ * matches for the new keys on the next page, we will change our mind.
+ * This is handled by "backing up" the array keys, and then starting a new
+ * primitive index scan for the same set of array keys.
+ *
+ * XXX Clearly it would be a lot more efficient if we were to implement
+ * all this by searching for the next set of array keys using this tuple's
+ * key values, directly. Right now we effectively use a linear search
+ * (though one that can terminate upon finding the first match). We must
+ * make it into a binary search to get acceptable performance.
+ *
+ * Our current naive approach works well enough for prototyping purposes,
+ * but chokes in extreme cases where the Cartesian product of all SAOP
+ * arrays (i.e. the total number of DNF single value predicates generated
+ * by the _bt_advance_array_keys state machine) starts to get unwieldy.
+ * We're holding a buffer lock here, so this isn't really negotiable.
+ *
+ * It's not particular unlikely that the total number of DNF predicates
+ * exceeds the number of tuples that'll be returned by the ongoing scan.
+ * Efficiently advancing the array keys might turn out to matter almost as
+ * much as efficiently searching for the next matching index tuple.
+ */
+ _bt_preprocess_keys(scan);
+
+ if (pstate->highkey)
+ {
+ /* High key precheck might need to be repeated for new array keys */
+ pstate->match_for_cur_array_keys = false;
+ pstate->highkeychecked = false;
+ }
+
+ /*
+ * Note: It doesn't matter how continuescan is set by us at this point.
+ * The next iteration of caller's loop will overwrite continuescan.
+ */
+ return true;
+}
+
+/*
+ * Helper routine used by _bt_advance_array_keys_locally.
+ *
+ * We're called with tuples that _bt_checkkeys set continuescan to false for.
+ * We distinguish between search-type scan keys that have equality constraints
+ * on an index column (which are always marked as required in both directions)
+ * and other search-type scan keys that are required in one direction only.
+ * The distinction is important independent of the current scan direction,
+ * since caller should only advance array keys when an equality constraint
+ * indicated the end of the current set of array keys. (Note also that
+ * non-equality "required in one direction only" scan keys can only end the
+ * entire btrescan-wise scan when we run out of array keys to process for the
+ * current scan direction).
+ *
+ * We help our caller identify where matches for the next set of array keys
+ * _might_ begin when it turns out that we can elide another descent of the
+ * index for the next set of array keys. There will be a gap of 0 or more
+ * non-matching index tuples between the last tuple that satisfies the current
+ * set of scan keys (including its array keys), and the first tuple that might
+ * satisfy the next set (caller won't know for sure until after it advances
+ * the current set of array keys). This gap might be negligible, or it might
+ * be a significant fraction of all non-pivot tuples on the leaf level.
+ *
+ * The qual "WHERE x IN (3,4,5) AND y < 42" will have its 'y' scan key marked
+ * SK_BT_REQFWD (not SK_BT_REQBKWD) -- 'y' isn't an equality constraint.
+ * _bt_checkkeys will set continuescan=false as soon as the scan reaches a
+ * tuple matching (3, 42) or a tuple matching (4, 1). Eliding the next
+ * primitive index scan (by advancing the array keys locally) happens when the
+ * gap is confined to a single leaf page. Caller continues its scan through
+ * these gap tuples, and calls back here to check if it has found the point
+ * that it might be necessary to advance its array keys.
+ *
+ * Returns false when caller's tuple definitely isn't where the next group of
+ * matching tuples begins. Caller can either continue the process with the
+ * very next tuple from its leaf page, or give up completely. Giving up means
+ * that caller accepts that there must be another _bt_first descent (in the
+ * likely event of another call to btgettuple/btgetbitmap from the executor).
+ *
+ * Returns true when caller passed a tuple that might be a match for the next
+ * set of array keys. That is, when tuple is > the current set of array keys
+ * and other equality constraints for a forward scan (or < for a backwards
+ * scans). Caller must attempt to advance the array keys when this happens.
+ *
+ * Note: Our test is based on the current equality constraint scan keys rather
+ * than the next set in line because it's not yet clear if the next set in
+ * line will find any matches whatsoever. Once caller is positioned at the
+ * first tuple that might satisfy the next set of array keys, it could be
+ * necessary for it to advance its array keys more than once.
+ */
+static bool
+_bt_tuple_advances_keys(IndexScanDesc scan, IndexTuple tuple, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ bool tuple_ahead = true;
+ int ncmpkey;
+
+ Assert(so->qual_ok);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->numberOfKeys > 0);
+ Assert(so->arrayPoskey->keysz > 0);
+
+ ncmpkey = Min(BTreeTupleGetNAtts(tuple, rel), so->numberOfKeys);
+ for (int attnum = 1; attnum <= ncmpkey; attnum++)
+ {
+ ScanKey cur = &so->keyData[attnum - 1];
+ ScanKey iscankey;
+ Datum datum;
+ bool isNull;
+ int32 result;
+
+ if ((ScanDirectionIsForward(dir) &&
+ (cur->sk_flags & SK_BT_REQFWD) == 0) ||
+ (ScanDirectionIsBackward(dir) &&
+ (cur->sk_flags & SK_BT_REQBKWD) == 0))
+ {
+ /*
+ * This scan key is not marked as required for the current
+ * direction, so there are no further attributes to consider. This
+ * tuple definitely isn't at the start of the next group of
+ * matching tuples.
+ */
+ break;
+ }
+
+ Assert(cur->sk_attno == attnum);
+ if (cur->sk_attno > so->arrayPoskey->keysz)
+ {
+ /*
+ * There is no equality constraint on this column/scan key to
+ * break the tie. This tuple definitely isn't at the start of the
+ * next group of matching tuples.
+ */
+ Assert(cur->sk_strategy != BTEqualStrategyNumber);
+ Assert((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) !=
+ (SK_BT_REQFWD | SK_BT_REQBKWD));
+ break;
+ }
+
+ /*
+ * This column has an equality constraint/insertion scan key entry
+ */
+ Assert((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) ==
+ (SK_BT_REQFWD | SK_BT_REQBKWD));
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ /*
+ * Row comparison scan keys may be present after (though never before)
+ * columns that we recognized as having equality constraints.
+ *
+ * A qual like "WHERE a in (1, 2, 3) AND (b, c) >= (500, 7)" is safe,
+ * whereas "WHERE (a, b) >= (1, 500) AND c in (7, 8, 9)" is unsafe.
+ * Assert that this isn't one of the unsafe cases in passing.
+ */
+ Assert((cur->sk_flags & SK_ROW_HEADER) == 0);
+
+ /*
+ * We'll need to use this attribute's 3-way comparison order proc
+ * (btree opclass support function 1) from its insertion-type scan key
+ */
+ iscankey = &so->arrayPoskey->scankeys[attnum - 1];
+ Assert(iscankey->sk_flags == cur->sk_flags);
+ Assert(iscankey->sk_attno == cur->sk_attno);
+ Assert(iscankey->sk_subtype == cur->sk_subtype);
+ Assert(iscankey->sk_collation == cur->sk_collation);
+
+ /*
+ * The 3-way comparison order proc will be called using the
+ * search-type scan key's current sk_argument
+ */
+ datum = index_getattr(tuple, attnum, itupdesc, &isNull);
+ if (iscankey->sk_flags & SK_ISNULL) /* key is NULL */
+ {
+ if (isNull)
+ result = 0; /* NULL "=" NULL */
+ else if (iscankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull) /* key is NOT_NULL and item is NULL */
+ {
+ if (iscankey->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * The sk_func needs to be passed the index value as left arg and
+ * the sk_argument as right arg (they might be of different
+ * types). We want to keep this consistent with what _bt_compare
+ * does, so we flip the sign of the comparison result. (Unless
+ * it's a DESC column, in which case we *don't* flip the sign.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(&iscankey->sk_func,
+ cur->sk_collation, datum,
+ cur->sk_argument));
+ if (!(iscankey->sk_flags & SK_BT_DESC))
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ if (result != 0)
+ {
+ if (ScanDirectionIsForward(dir))
+ tuple_ahead = result < 0;
+ else
+ tuple_ahead = result > 0;
+
+ break;
+ }
+ }
+
+ return tuple_ahead;
+}
+
/*
* _bt_mark_array_keys() -- Handle array keys during btmarkpos
*
@@ -744,6 +1174,12 @@ _bt_restore_array_keys(IndexScanDesc scan)
* storage is that we are modifying the array based on comparisons of the
* key argument values, which could change on a rescan or after moving to
* new elements of array keys. Therefore we can't overwrite the source data.
+ *
+ * TODO Replace all calls to this function added by the patch with calls to
+ * some other more specialized function with reduced surface area -- something
+ * that is explicitly safe to call while holding a buffer lock. That's been
+ * put off for now because the code in this function is likely to need to be
+ * better integrated with the planner before long anyway.
*/
void
_bt_preprocess_keys(IndexScanDesc scan)
@@ -1012,6 +1448,45 @@ _bt_preprocess_keys(IndexScanDesc scan)
so->numberOfKeys = new_numberOfKeys;
}
+/*
+ * Save insertion scankey for searches with a SK_SEARCHARRAY scan key.
+ *
+ * We must save the initial positioning insertion scan key for SK_SEARCHARRAY
+ * scans (barring those that only have SK_SEARCHARRAY inequalities). Each
+ * insertion scan key entry/column will have a corresponding "=" operator in
+ * caller's search-type scan key, but that's no substitute for the 3-way
+ * comparison function.
+ *
+ * _bt_tuple_advances_keys needs to perform 3-way comparisons to figure out if
+ * an ongoing scan can elide another descent of the index in _bt_first. It
+ * works by locating the end of the _current_ set of equality constraint type
+ * scan keys -- not by locating the start of the next set. This is not unlike
+ * the approach taken by _bt_search with a nextkey=true search.
+ */
+void
+_bt_array_keys_save_scankeys(IndexScanDesc scan, BTScanInsert inskey)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Size sksize;
+
+ Assert(inskey->keysz > 0);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->qual_ok);
+ Assert(!BTScanPosIsValid(so->currPos));
+
+ if (so->arrayPoskey)
+ {
+ /* Reuse the insertion scan key from the last primitive index scan */
+ Assert(so->arrayPoskey->keysz == inskey->keysz);
+ return;
+ }
+
+ sksize = offsetof(BTScanInsertData, scankeys) +
+ sizeof(ScanKeyData) * inskey->keysz;
+ so->arrayPoskey = palloc(sksize);
+ memcpy(so->arrayPoskey, inskey, sksize);
+}
+
/*
* Compare two scankey values using a specified operator.
*
@@ -1348,35 +1823,68 @@ _bt_mark_scankey_required(ScanKey skey)
* this tuple, and set *continuescan accordingly. See comments for
* _bt_preprocess_keys(), above, about how this is done.
*
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Advances the current set of array keys locally for SK_SEARCHARRAY scans
+ * where appropriate. These callers are required to initialize the page level
+ * high key in pstate.
*
* scan: index scan descriptor (containing a search-type scankey)
* tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * final: final tuple/call for this page, from a backwards scan?
+ * pstate: Page level input and output parameters
*/
bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, bool final,
+ BTReadPageState *pstate)
+{
+ TupleDesc tupdesc = RelationGetDescr(scan->indexRelation);
+ int natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool res;
+
+ /* This loop handles advancing to the next array elements, if any */
+ do
+ {
+ res = _bt_check_compare(so->keyData, so->numberOfKeys,
+ tuple, natts, tupdesc,
+ pstate->dir, &pstate->continuescan);
+
+ /* If we have a tuple, return it ... */
+ if (res)
+ {
+ pstate->match_for_cur_array_keys = true;
+
+ Assert(!so->numArrayKeys || !so->arrayPoskey ||
+ _bt_tuple_advances_keys(scan, tuple, pstate->dir));
+ break;
+ }
+
+ /* ... otherwise see if we have more array keys to deal with */
+ } while (so->numArrayKeys && !pstate->continuescan &&
+ _bt_advance_array_keys_locally(scan, tuple, final, pstate));
+
+ return res;
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.
+ */
+static bool
+_bt_check_compare(ScanKey keyData, int keysz,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ ScanDirection dir, bool *continuescan)
{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
int ikey;
ScanKey key;
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
*continuescan = true; /* default assumption */
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ for (key = keyData, ikey = 0; ikey < keysz; key++, ikey++)
{
Datum datum;
bool isNull;
@@ -1523,7 +2031,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* it's not possible for any future tuples in the current scan direction
* to pass the qual.
*
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
*/
static bool
_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
@@ -1690,6 +2198,49 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
return result;
}
+void
+_bt_checkfinalkeys(IndexScanDesc scan, BTReadPageState *pstate)
+{
+ IndexTuple highkey = pstate->highkey;
+
+ Assert(pstate->continuescan);
+
+ if (!pstate->highkey)
+ {
+ _bt_nocheckkeys(scan, pstate->dir);
+ pstate->continuescan = false;
+ return;
+ }
+
+ pstate->highkey = NULL;
+ _bt_checkkeys(scan, highkey, false, pstate);
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ *
+ * Called on the rightmost page in the forward scan case, and the leftmost
+ * page in the backwards scan case. Only call here when _bt_checkkeys hasn't
+ * already set continuescan to false.
+ */
+bool
+_bt_nocheckkeys(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ /* Only need to do real work in SK_SEARCHARRAY case, for now */
+ if (!so->numArrayKeys)
+ return false;
+
+ Assert(so->arrayKeysStarted);
+
+ while (_bt_advance_array_keys(scan, dir))
+ _bt_preprocess_keys(scan);
+
+ return true;
+}
+
/*
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..73064758d 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -32,6 +32,7 @@
#include "optimizer/paths.h"
#include "optimizer/prep.h"
#include "optimizer/restrictinfo.h"
+#include "utils/fmgroids.h"
#include "utils/lsyscache.h"
#include "utils/selfuncs.h"
@@ -107,7 +108,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
bool useful_predicate,
ScanTypeControl scantype,
bool *skip_nonnative_saop,
- bool *skip_lower_saop);
+ bool *skip_unordered_saop);
static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
List *clauses, List *other_clauses);
static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +707,8 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
* index AM supports them natively, we should just include them in simple
* index paths. If not, we should exclude them while building simple index
* paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
+ * Furthermore, we should consider excluding ScalarArrayOpExpr quals whose
+ * inclusion would force the path as a whole to be unordered.
*/
static void
get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,28 +717,28 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
{
List *indexpaths;
bool skip_nonnative_saop = false;
- bool skip_lower_saop = false;
+ bool skip_unordered_saop = false;
ListCell *lc;
/*
* Build simple index paths using the clauses. Allow ScalarArrayOpExpr
* clauses only if the index AM supports them natively, and skip any such
- * clauses for index columns after the first (so that we produce ordered
- * paths if possible).
+ * clauses for index columns whose inclusion would make it impossible to
+ * produce ordered paths.
*/
indexpaths = build_index_paths(root, rel,
index, clauses,
index->predOK,
ST_ANYSCAN,
&skip_nonnative_saop,
- &skip_lower_saop);
+ &skip_unordered_saop);
/*
- * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
- * that supports them, then try again including those clauses. This will
- * produce paths with more selectivity but no ordering.
+ * If we skipped any ScalarArrayOpExprs without ordered paths on an index
+ * with an AM that supports them, then try again including those clauses.
+ * This will produce paths with more selectivity.
*/
- if (skip_lower_saop)
+ if (skip_unordered_saop)
{
indexpaths = list_concat(indexpaths,
build_index_paths(root, rel,
@@ -817,11 +818,9 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
* to true if we found any such clauses (caller must initialize the variable
* to false). If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
*
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false). If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
+ * If skip_unordered_saop is non-NULL, we ignore ScalarArrayOpExpr clauses
+ * whose inclusion forces us to treat the scan's output as unordered. If it's
+ * NULL then we allow it, in order to produce paths with greater selectivity.
*
* 'rel' is the index's heap relation
* 'index' is the index for which we want to generate paths
@@ -829,7 +828,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
* 'useful_predicate' indicates whether the index has a useful predicate
* 'scantype' indicates whether we need plain or bitmap scan support
* 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
+ * 'skip_unordered_saop' indicates whether to accept unordered SOAPs
*/
static List *
build_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -837,7 +836,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
bool useful_predicate,
ScanTypeControl scantype,
bool *skip_nonnative_saop,
- bool *skip_lower_saop)
+ bool *skip_unordered_saop)
{
List *result = NIL;
IndexPath *ipath;
@@ -848,10 +847,13 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
- bool found_lower_saop_clause;
+ bool row_compare_seen_already;
+ bool saop_included_already;
+ bool saop_invalidates_ordering;
bool pathkeys_possibly_useful;
bool index_is_ordered;
bool index_only_scan;
+ int prev_equality_indexcol;
int indexcol;
/*
@@ -880,25 +882,27 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
* on by btree and possibly other places.) The list can be empty, if the
* index AM allows that.
*
- * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
- * index clause for a non-first index column. This prevents us from
- * assuming that the scan result is ordered. (Actually, the result is
- * still ordered if there are equality constraints for all earlier
- * columns, but it seems too expensive and non-modular for this code to be
- * aware of that refinement.)
+ * saop_invalidates_ordering is set true if we accept a ScalarArrayOpExpr
+ * index clause that invalidates the sort order. In practice this is
+ * always due to the presence of a non-first index column. This prevents
+ * us from assuming that the scan result is ordered.
*
* We also build a Relids set showing which outer rels are required by the
* selected clauses. Any lateral_relids are included in that, but not
* otherwise accounted for.
*/
index_clauses = NIL;
- found_lower_saop_clause = false;
+ prev_equality_indexcol = -1;
+ row_compare_seen_already = false;
+ saop_included_already = false;
+ saop_invalidates_ordering = false;
outer_relids = bms_copy(rel->lateral_relids);
for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
{
+ List *colclauses = clauses->indexclauses[indexcol];
ListCell *lc;
- foreach(lc, clauses->indexclauses[indexcol])
+ foreach(lc, colclauses)
{
IndexClause *iclause = (IndexClause *) lfirst(lc);
RestrictInfo *rinfo = iclause->rinfo;
@@ -906,6 +910,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/* We might need to omit ScalarArrayOpExpr clauses */
if (IsA(rinfo->clause, ScalarArrayOpExpr))
{
+ ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
+
if (!index->amsearcharray)
{
if (skip_nonnative_saop)
@@ -916,18 +922,152 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
}
/* Caller had better intend this only for bitmap scan */
Assert(scantype == ST_BITMAPSCAN);
+ saop_invalidates_ordering = true; /* defensive */
+ goto include_clause;
}
- if (indexcol > 0)
+
+ /*
+ * Index AM that handles ScalarArrayOpExpr quals natively.
+ *
+ * We assume that it's always better to apply a clause as an
+ * indexqual than as a filter (qpqual); which is where an
+ * available clause would end up being applied if we omit it
+ * from the indexquals.
+ *
+ * XXX Currently, nbtree just assumes that all SK_SEARCHARRAY
+ * search-type scankeys will be marked as required, with the
+ * exception of the first attribute without an "=" key (any
+ * such attribute is marked SK_BT_REQFWD or SK_BT_REQBKWD, but
+ * it won't be in the initial positioning insertion scan key,
+ * so _bt_array_continuescan() won't consider it).
+ */
+ if (row_compare_seen_already)
{
- if (skip_lower_saop)
+ /*
+ * Cannot safely include a ScalarArrayOpExpr after a
+ * higher-order RowCompareExpr (barring the "=" case).
+ */
+ Assert(indexcol > 0);
+ continue;
+ }
+
+ /*
+ * Make a blanket assumption that any index column with more
+ * than a single clause cannot include ScalarArrayOpExpr
+ * clauses >= that column. Quals like "WHERE my_col in (1,2)
+ * AND my_col < 1" are unsafe without this.
+ *
+ * XXX This is overkill.
+ */
+ if (list_length(colclauses) > 1)
+ continue;
+
+ if (indexcol != prev_equality_indexcol + 1)
+ {
+ /*
+ * An index attribute that lacks an equality constraint
+ * was included as a clause already. This may make it
+ * unsafe to include this ScalarArrayOpExpr clause now.
+ */
+ if (saop_included_already)
{
- /* Caller doesn't want to lose index ordering */
- *skip_lower_saop = true;
+ /*
+ * We included at least one ScalarArrayOpExpr clause
+ * earlier, too. (This must have been included before
+ * the inequality, since we treat ScalarArrayOpExpr
+ * clauses as equality constraints by default.)
+ *
+ * We cannot safely include this ScalarArrayOpExpr as
+ * a clause for the current index path. It'll become
+ * qpqual conditions instead.
+ */
continue;
}
- found_lower_saop_clause = true;
+
+ /*
+ * This particular ScalarArrayOpExpr happens to be the
+ * most significant one encountered so far. That makes it
+ * safe to include, despite gaps in constraints on prior
+ * index columns -- provided we invalidate ordering for
+ * the index path as a whole.
+ */
+ if (skip_unordered_saop)
+ {
+ /* Caller doesn't want to lose index ordering */
+ *skip_unordered_saop = true;
+ continue;
+ }
+
+ /* Caller prioritizes selectivity over ordering */
+ saop_invalidates_ordering = true;
}
+
+ /*
+ * Includable ScalarArrayOpExpr clauses are themselves
+ * equality constraints (they don't make the inclusion of
+ * further ScalarArrayOpExpr clauses invalidate ordering).
+ *
+ * XXX excludes inequality-type SAOPs using get_oprrest, which
+ * seems particularly kludgey.
+ */
+ saop_included_already = true;
+ if (saop->useOr && get_oprrest(saop->opno) == F_EQSEL)
+ prev_equality_indexcol = indexcol;
}
+ else if (IsA(rinfo->clause, NullTest))
+ {
+ NullTest *nulltest = (NullTest *) rinfo->clause;
+
+ /*
+ * Like ScalarArrayOpExpr clauses, IS NULL NullTest clauses
+ * are treated as equality conditions, despite not being
+ * recognized as such by the equivalence class machinery.
+ *
+ * This relies on the assumption that amsearcharray index AMs
+ * will treat NULL as just another value from the domain of
+ * indexed values for initial search purposes.
+ */
+ if (!nulltest->argisrow && nulltest->nulltesttype == IS_NULL)
+ prev_equality_indexcol = indexcol;
+ }
+ else if (IsA(rinfo->clause, RowCompareExpr))
+ {
+ /*
+ * RowCompareExpr clause will make it unsafe to include any
+ * ScalarArrayOpExpr encountered in lower-order clauses.
+ * (Already-included ScalarArrayOpExpr clauses can stay.)
+ */
+ row_compare_seen_already = true;
+ }
+ else if (rinfo->mergeopfamilies)
+ {
+ /*
+ * Equality constraint clause -- won't make it unsafe to
+ * include later ScalarArrayOpExpr clauses
+ */
+ prev_equality_indexcol = indexcol;
+ }
+ else
+ {
+ /*
+ * Clause isn't an equality condition according to the EQ
+ * machinery (not a NullTest or ScalarArrayOpExpr, either).
+ *
+ * If there are any later ScalarArrayOpExpr clauses, they must
+ * not be used as index quals. We'll either make it safe by
+ * setting saop_invalidates_ordering to true, or by just not
+ * including them (they can still be qpqual conditions).
+ *
+ * Note: there are several interesting types of expressions
+ * that we deem incompatible with ScalarArrayOpExpr clauses
+ * due to a lack of infrastructure to perform transformations
+ * of predicates from CNF (conjunctive normal form) to DNF
+ * (disjunctive normal form). The MDAM paper describes many
+ * examples of these transformations.
+ */
+ }
+
+ include_clause:
/* OK to include this clause */
index_clauses = lappend(index_clauses, iclause);
@@ -960,7 +1100,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
* assume the scan is unordered.
*/
pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
- !found_lower_saop_clause &&
+ !saop_invalidates_ordering &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..51de102b0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6700,9 +6700,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* For a RowCompareExpr, we consider only the first column, just as
* rowcomparesel() does.
*
- * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
- * index scans not one, but the ScalarArrayOpExpr's operator can be
- * considered to act the same as it normally does.
+ * If there's a ScalarArrayOpExpr in the quals, we'll perform N primitive
+ * index scans in the worst case. Assume that worst case, for now. We'll
+ * clamp later on if the tally approaches the total number of index pages.
*/
indexBoundQuals = NIL;
indexcol = 0;
@@ -6754,7 +6754,15 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
clause_op = saop->opno;
found_saop = true;
- /* count number of SA scans induced by indexBoundQuals only */
+
+ /*
+ * Count number of SA scans induced by indexBoundQuals only.
+ *
+ * Since this is multiplicative, it can wildly inflate the
+ * assumed number of descents (number of primitive index
+ * scans) for scans with several SAOP clauses. We might clamp
+ * num_sa_scans later on to deal with this.
+ */
if (alength > 1)
num_sa_scans *= alength;
}
@@ -6832,6 +6840,39 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
genericcostestimate(root, path, loop_count, &costs);
+ /*
+ * The btree index AM will automatically combine individual primitive
+ * index scans whenever the tuples covered by the next set of array keys
+ * are close to tuples covered by the current set. This optimization
+ * makes the final number of descents particularly difficult to estimate.
+ * However, btree scans never visit any single leaf page more than once.
+ * That puts a natural floor under the worst case number of descents.
+ *
+ * Clamp the number of descents to the estimated number of leaf page
+ * visits. This is still fairly pessimistic, but tends to result in more
+ * accurate costing of scans with several SAOP clauses -- especially when
+ * each array has more than a few elements.
+ *
+ * Also clamp the number of descents to 1/3 the number of index pages.
+ * This avoids implausibly high estimates with low selectivity paths,
+ * where scans frequently require no more than one or two descents.
+ *
+ * XXX genericcostestimate is still the dominant influence on the total
+ * cost of SAOP-heavy index paths -- indexTotalCost is still calculated in
+ * a way that assumes significant repeat access to leaf pages for a path
+ * with SAOP clauses. This just isn't sensible anymore. Note that nbtree
+ * scans promise to avoid accessing any leaf page more than once. The
+ * worst case I/O cost of an SAOP-heavy path is therefore guaranteed to
+ * never exceed the I/O cost of a conventional full index scan (though
+ * this relies on standard assumptions about internal page access costs).
+ */
+ if (num_sa_scans > 1)
+ {
+ num_sa_scans = Min(num_sa_scans, costs.numIndexPages);
+ num_sa_scans = Min(num_sa_scans, index->pages / 3);
+ num_sa_scans = Max(num_sa_scans, 1);
+ }
+
/*
* Add a CPU-cost component to represent the costs of initial btree
* descent. We don't charge any I/O cost for touching upper btree levels,
@@ -6847,7 +6888,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
{
descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost;
costs.indexStartupCost += descentCost;
- costs.indexTotalCost += costs.num_sa_scans * descentCost;
+ costs.indexTotalCost += num_sa_scans * descentCost;
}
/*
@@ -6858,11 +6899,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* in cases where only a single leaf page is expected to be visited. This
* cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
* touched. The number of such pages is btree tree height plus one (ie,
- * we charge for the leaf page too). As above, charge once per SA scan.
+ * we charge for the leaf page too). As above, charge once per estimated
+ * primitive SA scan.
*/
descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
costs.indexStartupCost += descentCost;
- costs.indexTotalCost += costs.num_sa_scans * descentCost;
+ costs.indexTotalCost += num_sa_scans * descentCost;
/*
* If we can get an estimate of the first column's ordering correlation C
--
2.40.1
On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan <pg@bowt.ie> wrote:
I've been working on a variety of improvements to nbtree's native
ScalarArrayOpExpr execution. This builds on Tom's work in commit
9e8da0f7.
Cool.
Attached patch is still at the prototype stage. I'm posting it as v1 a
little earlier than I usually would because there has been much back
and forth about it on a couple of other threads involving Tomas Vondra
and Jeff Davis -- seems like it would be easier to discuss with
working code available.The patch adds two closely related enhancements to ScalarArrayOp
execution by nbtree:1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
"advance the scan's array keys locally", which sometimes avoids
significant amounts of unneeded pinning/locking of the same set of
index pages.SAOP index scans become capable of eliding primitive index scans for
the next set of array keys in line in cases where it isn't truly
necessary to descend the B-Tree again. Index scans are now capable of
"sticking with the existing leaf page for now" when it is determined
that the end of the current set of array keys is physically close to
the start of the next set of array keys (the next set in line to be
materialized by the _bt_advance_array_keys state machine). This is
often possible.Naturally, we still prefer to advance the array keys in the
traditional way ("globally") much of the time. That means we'll
perform another _bt_first/_bt_search descent of the index, starting a
new primitive index scan. Whether we try to skip pages on the leaf
level or stick with the current primitive index scan (by advancing
array keys locally) is likely to vary a great deal. Even during the
same index scan. Everything is decided dynamically, which is the only
approach that really makes sense.This optimization can significantly lower the number of buffers pinned
and locked in cases with significant locality, and/or with many array
keys with no matches. The savings (when measured in buffers
pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
has shown that transaction throughput for variants of "pgbench -S"
designed to stress the implementation (hundreds of array constants)
under concurrent load can have up to 5.5x higher transaction
throughput with the patch. Less extreme cases (10 array constants,
spaced apart) see about a 20% improvement in throughput. There are
similar improvements to latency for the patch, in each case.
Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?
That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?
[...]
Skip Scan
=========MDAM encompasses something that people tend to call "skip scan" --
terminology with a great deal of baggage. These days I prefer to call
it "filling in missing key predicates", per the paper. That's much
more descriptive, and makes it less likely that people will conflate
the techniques with InnoDB style "loose Index scans" -- the latter is
a much more specialized/targeted optimization. (I now believe that
these are very different things, though I was thrown off by the
superficial similarities for a long time. It's pretty confusing.)
I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.
[...]
Thoughts?
MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used
for each column, that'd be 4^8, etc...
With an index column limit of 32, that's quite a lot of memory
potentially needed to execute the statement.
So, this begs the question: does this patch have the same issue? Does
it fail with OOM, does it gracefully fall back to the old behaviour
when the clauses are too complex to linearize/compose/fold into the
btree ordering clauses, or are scan keys dynamically constructed using
just-in-time- or generator patterns?
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?
It doesn't really cache leaf pages at all. What it does is advance the
array keys locally, while the original buffer lock is still held on
that same page.
That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?
I'm not sure what you mean. There is no reason why you need to do more
than one single descent of an index to scan many leaf pages using many
distinct sets of array keys. Obviously, this depends on being able to
observe that we really don't need to redescend the index to advance
the array keys, again and again. Note in particularly that this
usually works across leaf pages.
I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.
I think of that as being quite a different thing.
Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.
[...]
Thoughts?
MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries.
Note that I haven't actually changed anything about the way that the
state machine generates new sets of single value predicates -- it's
still just cycling through each distinct set of array keys in the
patch.
What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all. Plus you can skip over them using the
next index tuple. So skipping works both ways.
--
Peter Geoghegan
On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?It doesn't really cache leaf pages at all. What it does is advance the
array keys locally, while the original buffer lock is still held on
that same page.
Hmm, then I had a mistaken understanding of what we do in _bt_readpage
with _bt_saveitem.
That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?I'm not sure what you mean. There is no reason why you need to do more
than one single descent of an index to scan many leaf pages using many
distinct sets of array keys. Obviously, this depends on being able to
observe that we really don't need to redescend the index to advance
the array keys, again and again. Note in particularly that this
usually works across leaf pages.
In a NestedLoop(inner=seqscan, outer=indexscan), the index gets
repeatedly scanned from the root, right? It seems that right now, we
copy matching index entries into a local cache (that is deleted on
amrescan), then we drop our locks and pins on the buffer, and then
start returning values from our local cache (in _bt_saveitem).
We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.
The worst case overhead of this would be 2 _bt_compares (to check if
the value is supposed to be fully located on the cached leaf page)
plus one memcpy( , , BLCKSZ) in the previous loop. With some smart
heuristics (e.g. page fill factor, number of distinct values, and
whether we previously hit this same leaf page in the previous scan of
this Node) we can probably also reduce this overhead to a minimum if
the joined keys are not correllated, but accellerate the query
significantly when we find out they are correllated.
Of course, in the cases where we'd expect very few distinct join keys
the planner would likely put a Memoize node above the index scan, but
for mostly unique join keys I think this could save significant
amounts of time, if only on buffer pinning and locking.
I guess I'll try to code something up when I have the time, as it
sounds not quite exactly related to your patch but an interesting
improvement nonetheless.
Kind regards,
Matthias van de Meent
On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.
That sounds like block nested loop join. It's possible that that could
reuse some infrastructure from this patch, but I'm not sure.
In general, SAOP execution/MDAM performs "duplicate elimination before
it reads the data" by sorting and deduplicating the arrays up front.
While my patch sometimes elides a primitive index scan, primitive
index scans are already disjuncts that are combined to create what can
be considered one big index scan (that's how the planner and executor
think of them). The patch takes that one step further by recognizing
that it could quite literally be one big index scan in some cases (or
fewer, larger scans, at least). It's a natural incremental
improvement, as opposed to inventing a new kind of index scan. If
anything the patch makes SAOP execution more similar to traditional
index scans, especially when costing them.
Like InnoDB style loose index scan (for DISTINCT and GROUP BY
optimization), block nested loop join would require inventing a new
type of index scan. Both of these other two optimizations involve the
use of semantic information that spans multiple levels of abstraction.
Loose scan requires duplicate elimination (that's the whole point),
while IIUC block nested loop join needs to "simulate multiple inner
index scans" by deliberately returning duplicates for each would-be
inner index scan. These are specialized things.
To be clear, I think that all of these ideas are reasonable. I just
find it useful to classify these sorts of techniques according to
whether or not the index AM API would have to change or not, and the
general nature of any required changes. MDAM can do a lot of cool
things without requiring any revisions to the index AM API, which
should allow it to play nice with everything else (index path clause
safety issues notwithstanding).
--
Peter Geoghegan
On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.I think of that as being quite a different thing.
Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.
Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:
The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.
In the one page on "Loose indexscan", it refers to MySQL's "loose
index scan" documentation, which does handle groupings, and this was
targeted with the previous, mislabeled, "Index skipscan" patchset.
However, crucially, it also refers to other databases' Index Skip Scan
documentation, which document and implement this approach of 'skipping
to the next potential key range to get efficient non-prefix qual
results', giving me a false impression that those two features are one
and the same when they are not.
It seems like I'll have to wait a bit longer for the functionality of
Loose Index Scans.
[...]
Thoughts?
MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries.Note that I haven't actually changed anything about the way that the
state machine generates new sets of single value predicates -- it's
still just cycling through each distinct set of array keys in the
patch.What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all.
Yes, that's why I asked: The MDAM paper's examples seem to materialize
the full predicate up-front, which would require a product of all
indexed columns' quals in size, so that materialization has a good
chance to get really, really large. But if we're not doing that
materialization upfront, then there is no issue with resource
consumption (except CPU time, which can likely be improved with other
methods)
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
On Thu, 27 Jul 2023 at 06:14, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.That sounds like block nested loop join. It's possible that that could
reuse some infrastructure from this patch, but I'm not sure.
My idea is not quite block nested loop join. It's more 'restart the
index scan at the location the previous index scan ended, if
heuristics say there's a good chance that might save us time'. I'd say
it is comparable to the fast tree descent optimization that we have
for endpoint queries, and comparable to this patch's scankey
optimization, but across AM-level rescans instead of internal rescans.
See also the attached prototype and loosely coded patch. It passes
tests, but it might not be without bugs.
The basic design of that patch is this: We keep track of how many
times we've rescanned, and the end location of the index scan. If a
new index scan hits the same page after _bt_search as the previous
scan ended, we register that. Those two values - num_rescans and
num_samepage - are used as heuristics for the following:
If 50% or more of rescans hit the same page as the end location of the
previous scan, we start saving the scan's end location's buffer into
the BTScanOpaque, so that the next _bt_first can check whether that
page might be the right leaf page, and if so, immediately go to that
buffer instead of descending the tree - saving one tree descent in the
process.
Further optimizations of this mechanism could easily be implemented by
e.g. only copying the min/max index tuples instead of the full index
page, reducing the overhead at scan end.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
v1-0001-Cache-btree-scan-end-page-across-rescans-in-the-s.patch.cfbot-ignoreapplication/octet-stream; name=v1-0001-Cache-btree-scan-end-page-across-rescans-in-the-s.patch.cfbot-ignoreDownload
From e5232b8a8e90f60f45aadc813c5f024c9ecd8dab Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 27 Jul 2023 15:36:01 +0200
Subject: [PATCH v1] Cache btree scan end page across rescans in the same node
If the index is repeatedly scanned for values (e.g. in a nested loop) and if
the values that are being looked up are highly correlated, then we can likely
reuse the previous index scan's last page as a startpoint for the new scan,
instead of going through a relatively expensive index descent.
---
src/backend/access/nbtree/nbtpage.c | 18 +++++
src/backend/access/nbtree/nbtree.c | 8 +++
src/backend/access/nbtree/nbtsearch.c | 100 ++++++++++++++++++++++++--
src/include/access/nbtree.h | 21 ++++++
4 files changed, 141 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d78971bfe8..897b6772fc 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -856,6 +856,24 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
return buf;
}
+Buffer
+_bt_getrecentbuf(Relation rel, BlockNumber blkno, Buffer buf, int access)
+{
+ Assert(BlockNumberIsValid(blkno));
+ Assert(BufferIsValid(buf));
+
+ if (!ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+ {
+ /* Read an existing block of the relation */
+ buf = ReadBuffer(rel, blkno);
+ }
+
+ _bt_lockbuf(rel, buf, access);
+ _bt_checkpage(rel, buf);
+
+ return buf;
+}
+
/*
* _bt_allocbuf() -- Allocate a new block/page.
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4553aaee53..e4ca2f8ecb 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -376,6 +376,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
*/
so->currTuples = so->markTuples = NULL;
+ so->rescans = so->rescanSamePage = 0;
+ so->recentEndPage = InvalidBlockNumber;
+ so->pageCacheValid = false;
+ so->pageCache = (char *) palloc(BLCKSZ);
+
scan->xs_itupdesc = RelationGetDescr(rel);
scan->opaque = so;
@@ -402,6 +407,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
BTScanPosInvalidate(so->currPos);
}
+ so->rescans++;
so->markItemIndex = -1;
so->arrayKeyCount = 0;
BTScanPosUnpinIfPinned(so->markPos);
@@ -467,6 +473,8 @@ btendscan(IndexScanDesc scan)
/* Release storage */
if (so->keyData != NULL)
pfree(so->keyData);
+ if (so->pageCache != NULL)
+ pfree(so->pageCache);
/* so->arrayKeyData and so->arrayKeys are in arrayContext */
if (so->arrayContext != NULL)
MemoryContextDelete(so->arrayContext);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3230b3b894..67e597b56c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -47,6 +47,45 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+static void _bt_endscanonpage(BTScanOpaque btso, Buffer buf,
+ BlockNumber endPage, Page page);
+
+#define RescanMayHitSamePage(so) (((float) ((so)->rescanSamePage) / (float) ((so)->rescans)) >= 0.5)
+
+static void
+_bt_endscanonpage(BTScanOpaque btso, Buffer buf, BlockNumber endPage, Page page)
+{
+ BlockNumber prevEndPage = btso->recentEndPage;
+
+ /*
+ * We have often (>50%) hit the page the previous scan ended on, so
+ * cache the current (last) page of the scan for future use.
+ */
+ if (RescanMayHitSamePage(btso))
+ {
+ /*
+ * If we have a valid cache, and the cache contains this page, then
+ * we don't have anything to do.
+ */
+ if (prevEndPage == endPage && btso->pageCacheValid)
+ {
+ /* do nothing */
+ }
+ else
+ {
+ memcpy(btso->pageCache, page, BLCKSZ);
+ btso->pageCacheValid = true;
+ btso->recentBuffer = buf;
+ btso->recentEndPage = endPage;
+ }
+ }
+ else if (prevEndPage != endPage)
+ {
+ btso->recentEndPage = endPage;
+ btso->pageCacheValid = false;
+ }
+}
+
/*
* _bt_drop_lock_and_maybe_pin()
@@ -872,7 +911,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
{
Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- Buffer buf;
+ Buffer buf = InvalidBuffer;
BTStack stack;
OffsetNumber offnum;
StrategyNumber strat;
@@ -1371,13 +1410,59 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
inskey.keysz = keysCount;
/*
- * Use the manufactured insertion scan key to descend the tree and
- * position ourselves on the target leaf page.
+ * If we've restarted the scan through amrescan, it is quite possible
+ * that a previous scan ended on the same page we're trying to find.
+ * If this happened repeatedly, we cache the last page of the scan,
+ * so that we may be able to ignore the penalty of traversing the tree
+ * from the top.
+ *
+ * XXX: Maybe this might not be concurrency-safe? Haven't thought about it
+ * quite yet.
*/
- stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ if (RescanMayHitSamePage(so) && so->pageCacheValid)
+ {
+ Page page = so->pageCache;
+ BTPageOpaque opaque = BTPageGetOpaque(page);
+
+ /* cached page is a leaf page, and is not empty */
+ Assert(P_ISLEAF(opaque) && PageGetMaxOffsetNumber(page) != 0);
+
+ /*
+ * If the search key doesn't fit within the min/max of this page,
+ * we continue with normal index descent.
+ */
+ if (_bt_compare(rel, &inskey, page, P_HIKEY) >= 0)
+ goto nocache;
+ if (_bt_compare(rel, &inskey, page, PageGetMaxOffsetNumber(page)) <= 0)
+ goto nocache;
+
+ buf = _bt_getrecentbuf(rel, so->recentEndPage, so->recentBuffer,
+ BT_READ);
- /* don't need to keep the stack around... */
- _bt_freestack(stack);
+ /*
+ * It is possible the page has split in the meantime, so we may have
+ * to move right
+ */
+ buf = _bt_moveright(rel, NULL, &inskey, buf, false, NULL, BT_READ,
+ scan->xs_snapshot);
+ so->rescanSamePage += 1;
+ }
+
+nocache:
+ if (!BufferIsValid(buf))
+ {
+ /*
+ * Use the manufactured insertion scan key to descend the tree and
+ * position ourselves on the target leaf page.
+ */
+ stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ, scan->xs_snapshot);
+
+ /* don't need to keep the stack around... */
+ _bt_freestack(stack);
+
+ if (buf == so->recentBuffer)
+ so->rescanSamePage += 1;
+ }
if (!BufferIsValid(buf))
{
@@ -1792,6 +1877,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
}
+ if (!continuescan)
+ _bt_endscanonpage(so, so->currPos.buf, so->currPos.currPage, page);
+
return (so->currPos.firstItem <= so->currPos.lastItem);
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa7973..b76d433725 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1062,6 +1062,26 @@ typedef struct BTScanOpaqueData
char *currTuples; /* tuple storage for currPos */
char *markTuples; /* tuple storage for markPos */
+ /*
+ * If we're on the outer side of a loop join, parameterized joins may
+ * have some correlation, i.e. repeated close values which fit on
+ * nearby pages. By tracking at which leaf page we start and end our
+ * scans, we can detect this case and (in some cases) reduce buffer
+ * accesses by a huge margin.
+ *
+ * TODO: Caching this page locally is OK, because this is inside a query
+ * context and thus bound to the lifetime and snapshot of the query.
+ * Accessing an updated version of the page might not be, so that needs
+ * checking.
+ */
+ int rescans; /* number of rescans */
+ int rescanSamePage; /* number of rescans that started on the previous scan's end location */
+
+ BlockNumber recentEndPage; /* scan end page of previous scan */
+ Buffer recentBuffer; /* buffer of recentEndPage. May not match. */
+ bool pageCacheValid; /* is the cached page valid? */
+ char *pageCache; /* copy of the previous scan's last page, if valid */
+
/*
* If the marked position is on the same page as current position, we
* don't use markPos, but just keep the marked itemIndex in markItemIndex
@@ -1207,6 +1227,7 @@ extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
bool *allequalimage);
extern void _bt_checkpage(Relation rel, Buffer buf);
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getrecentbuf(Relation rel, BlockNumber blkno, Buffer buffer, int access);
extern Buffer _bt_allocbuf(Relation rel, Relation heaprel);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
BlockNumber blkno, int access);
--
2.40.1
On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.
A lot of people (myself included) were confused on this point for
quite a while. To make matters even more confusing, one of the really
compelling cases for the MDAM design is scans that feed into
GroupAggregates -- preserving index sort order for naturally big index
scans will tend to enable it. One of my examples from the start of
this thread showed just that. (It just so happened that that example
was faster because of all the "skipping" that nbtree *wasn't* doing
with the patch.)
Yes, that's why I asked: The MDAM paper's examples seem to materialize
the full predicate up-front, which would require a product of all
indexed columns' quals in size, so that materialization has a good
chance to get really, really large. But if we're not doing that
materialization upfront, then there is no issue with resource
consumption (except CPU time, which can likely be improved with other
methods)
I get why you asked. I might have asked the same question.
As I said, the MDAM paper has *surprisingly* little to say about
B-Tree executor stuff -- it's almost all just describing the
preprocessing/transformation process. It seems as if optimizations
like the one from my patch were considered too obvious to talk about
and/or out of scope by the authors. Thinking about the MDAM paper like
that was what made everything fall into place for me. Remember,
"missing key predicates" isn't all that special.
--
Peter Geoghegan
On Thu, 27 Jul 2023 at 16:01, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.A lot of people (myself included) were confused on this point for
quite a while.
I've taken the liberty to update the "Loose indexscan" wiki page [0]https://wiki.postgresql.org/wiki/Loose_indexscan,
adding detail that Loose indexscans are distinct from Skip scans, and
showing some high-level distinguishing properties.
I also split the TODO entry for `` "loose" or "skip" scan `` into two,
and added links to the relevant recent threads so that it's clear
these are different (and that some previous efforts may have had a
confusing name).
I hope this will reduce the chance of future confusion between the two
different approaches to improving index scan performance.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hi, all!
CNF -> DNF conversion
=====================Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]. Much work remains here.
Honestly, I'm just reading and delving into this thread and other topics
related to it, so excuse me if I ask you a few obvious questions.
I noticed that you are going to add CNF->DNF transformation at the index
construction stage. If I understand correctly, you will rewrite
restrictinfo node,
change boolean "AND" expressions to "OR" expressions, but would it be
possible to apply such a procedure earlier? Otherwise I suppose you
could face the problem of
incorrect selectivity of the calculation and, consequently, the
cardinality calculation?
I can't clearly understand at what stage it is clear that the such a
transformation needs to be applied?
--
Regards,
Alena Rybakina
Postgres Professional
On Thu, Jul 27, 2023 at 10:00 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
My idea is not quite block nested loop join. It's more 'restart the
index scan at the location the previous index scan ended, if
heuristics say there's a good chance that might save us time'. I'd say
it is comparable to the fast tree descent optimization that we have
for endpoint queries, and comparable to this patch's scankey
optimization, but across AM-level rescans instead of internal rescans.
Yeah, I see what you mean. Seems related, even though what you've
shown in your prototype patch doesn't seem like it fits into my
taxonomy very neatly.
(BTW, I was a little confused by the use of the term "endpoint" at
first, since there is a function that uses that term to refer to a
descent of the tree that happens without any insertion scan key. This
path is used whenever the best we can do in _bt_first is to descend to
the rightmost or leftmost page.)
The basic design of that patch is this: We keep track of how many
times we've rescanned, and the end location of the index scan. If a
new index scan hits the same page after _bt_search as the previous
scan ended, we register that.
I can see one advantage that block nested loop join would retain here:
it does block-based accesses on both sides of the join. Since it
"looks ahead" on both sides of the join, more repeat accesses are
likely to be avoided.
Not too sure how much that matters in practice, though.
--
Peter Geoghegan
On Mon, Jul 31, 2023 at 12:24 PM Alena Rybakina
<lena.ribackina@yandex.ru> wrote:
I noticed that you are going to add CNF->DNF transformation at the index
construction stage. If I understand correctly, you will rewrite
restrictinfo node,
change boolean "AND" expressions to "OR" expressions, but would it be
possible to apply such a procedure earlier?
Sort of. I haven't really added any new CNF->DNF transformations. The
code you're talking about is really just checking that every index
path has clauses that we know that nbtree can handle. That's a big,
ugly modularity violation -- many of these details are quite specific
to the nbtree index AM (in theory we could have other index AMs that
are amsearcharray).
At most, v1 of the patch makes greater use of an existing
transformation that takes place in the nbtree index AM, as it
preprocesses scan keys for these types of queries (it's not inventing
new transformations at all). This is a slightly creative
interpretation, too. Tom's commit 9e8da0f7 didn't actually say
anything about CNF/DNF.
Otherwise I suppose you
could face the problem of
incorrect selectivity of the calculation and, consequently, the
cardinality calculation?
I can't think of any reason why that should happen as a direct result
of what I have done here. Multi-column index paths + multiple SAOP
clauses are not a new thing. The number of rows returned does not
depend on whether we have some columns as filter quals or not.
Of course that doesn't mean that the costing has no problems. The
costing definitely has several problems right now.
It also isn't necessarily okay that it's "just as good as before" if
it turns out that it needs to be better now. But I don't see why it
would be. (Actually, my hope is that selectivity estimation might be
*less* important as a practical matter with the patch.)
I can't clearly understand at what stage it is clear that the such a
transformation needs to be applied?
I don't know either.
I think that most of this work needs to take place in the nbtree code,
during preprocessing. But it's not so simple. There is a mutual
dependency between the code that generates index paths in the planner
and nbtree scan key preprocessing. The planner needs to know what
kinds of index paths are possible/safe up-front, so that it can choose
the fastest plan (the fastest that the index AM knows how to execute
correctly). But, there are lots of small annoying nbtree
implementation details that might matter, and can change.
I think we need to have nbtree register a callback, so that the
planner can initialize some preprocessing early. I think that we
require a "two way conversation" between the planner and the index AM.
--
Peter Geoghegan
On Wed, Jul 26, 2023 at 6:41 AM Peter Geoghegan <pg@bowt.ie> wrote:
MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries.
What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all. Plus you can skip over them using the
next index tuple. So skipping works both ways.
Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values). This approach was necessary for fairly
mundane reasons (it limits the amount of work required while holding a
buffer lock), but it also solves quite a few other problems that I
find far more interesting.
It's easy to imagine the state machine from v2 of the patch being
extended for skip scan. My approach "abstracts away" the arrays. For
skip scan, it would more or less behave as if the user had written a
query "WHERE a in (<Every possible value for this column>) AND b = 5
... " -- without actually knowing what the so-called array keys for
the high-order skipped column are (not up front, at least). We'd only
need to track the current "array key" for the scan key on the skipped
column, "a". The state machine would notice when the scan had reached
the next-greatest "a" value in the index (whatever that might be), and
then make that the current value. Finally, the state machine would
effectively instruct its caller to consider repositioning the scan via
a new descent of the index. In other words, almost everything for skip
scan would work just like regular SAOPs -- and any differences would
be well encapsulated.
But it's not just skip scan. This approach also enables thinking of
SAOP index scans (using nbtree) as just another type of indexable
clause, without any special restrictions (compared to true indexable
operators such as "=", say). Particularly in the planner. That was
always the general thrust of teaching nbtree about SAOPs, from the
start. But it's something that should be totally embraced IMV. That's
just what the patch proposes to do.
In particular, the patch now:
1. Entirely removes the long-standing restriction on generating path
keys for index paths with SAOPs, even when there are inequalities on a
high order column present. You can mix SAOPs together with other
clause types, arbitrarily, and everything still works and works
efficiently.
For example, the regression test expected output for this query/test
(from bugfix commit 807a40c5) is updated by the patch, as shown here:
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
- Sort Key: thousand
- -> Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY
('{1001,3000}'::integer[])))
-(4 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
We don't need a sort node anymore -- even though the leading column
here (thousand) uses an inequality, a particularly tricky case. Now
it's an index scan, much like any other, with no particular
restrictions caused by using a SAOP.
2. Adds an nbtree strategy for non-required equality array scan keys,
which is built on the same state machine, with only minor differences
to deal with column values "appearing out of key space order".
3. Simplifies the optimizer side of things by consistently avoiding
filter quals (except when it's truly unavoidable). The optimizer
doesn't even consider alternative index paths with filter quals for
lower-order SAOP columns, because they have no possible advantage
anymore. On the other hand, as we saw already, upthread, filter quals
have huge disadvantages. By always using true index quals, we
automatically avoid any question of getting excessive amounts of heap
page accesses just to eliminate non-matching rows. AFAICT we don't
need to make a trade-off here.
The first version of the patch added some crufty code to the
optimizer, to account for various restrictions on sort order. This
revised version actually removes existing cruft from the same place
(indxpath.c) instead.
Items 1, 2, and 3 are all closely related. Take the query I've shown
for item 1. Bugfix commit 807a40c5 (which added the test query in
question) dealt with an oversight in the then-recent original nbtree
SAOP patch (commit 9e8da0f7): when nbtree combines two primitive index
scans with an inequality on their leading column, we cannot be sure
that the output will appear in the same order as the order that one
big continuous index scan returns rows in. We can only expect to
maintain the illusion that we're doing one continuous index scan when
individual primitive index scans access earlier columns via the
equality strategy -- we need "equality constraints".
In practice, the optimizer (indxpath.c) is very conservative (more
conservative than it really needs to be) when it comes to trusting the
index scan to output rows in index order, in the presence of SAOPs.
All of that now seems totally unnecessary. Again, I don't see a need
to make a trade-off here.
My observation about this query (and others like it) is: why not
literally perform one continuous index scan instead (not multiple
primitive index scans)? That is strictly better, given all the
specifics here. Once we have a way to do that (which the nbtree
executor work listed under item 2 provides), it becomes safe to assume
that the tuples will be output in index order -- there is no illusion
left to preserve. Who needs an illusion that isn't actually helping
us? We actually do less I/O by using this strategy, for the usual
reasons (we can avoid repeating index page accesses).
A more concrete benefit of the non-required-scankeys stuff can be seen
by running Benoit Tigeot's test case [1]https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491 -- Peter Geoghegan with v2. He had a query like
this:
SELECT * FROM docs
WHERE status IN ('draft', 'sent') AND
sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280')
ORDER BY sent_at DESC NULLS LAST LIMIT 20;
And, his test case had an index on "sent_at DESC NULLS LAST,
sender_reference, status". This variant was a weak spot for v1.
v2 of the patch is vastly more efficient here, since we don't have to
go to the heap to eliminate non-matching tuples -- that can happen in
the index AM instead. This can easily be 2x-3x faster on a warm cache,
and have *hundreds* of times fewer buffer accesses (which Benoit
verified with an early version of this v2). All because we now require
vastly less heap access -- the quals are fairly selective here, and we
have to scan hundreds of leaf pages before the scan can terminate.
Avoiding filter quals is a huge win.
This particular improvement is hard to squarely attribute to any one
of my 3 items. The immediate problem that the query presents us with
on the master branch is the problem of filter quals that require heap
accesses to do visibility checks (a problem that index quals can never
have). That makes it tempting to credit my item 3. But you can't
really have item 3 without also having items 1 and 2. Taken together,
they eliminate all possible downsides from using index quals.
That high level direction (try to have one good choice for the
optimizer) seems important to me. Both for this project, and in
general.
Other changes in v2:
* Improved costing, that takes advantage of the fact that nbtree now
promises to not repeat any leaf page accesses (unless the scan is
restarted or the direction of the scan changes). This makes the worst
case far more predictable, and more related to selectivity estimation
-- you can't scan more pages than you have in the whole index. Just
like with every other sort of index scan.
* Support for parallel index scans.
The existing approach to array keys for parallel index scan has been
adopted to work with individual primitive index scans, not individual
array keys. I haven't tested this very thoroughly just yet, but it
seems to work well enough already. I think that it's important to not
have very much variation between parallel and serial index scans,
which I seem to have mostly avoided.
[1]: https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491 -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
v2-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v2-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload
From 7d8041cbf41736981431a0d063e5ecdc592402ee Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v2] Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals. This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.
The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access). The general idea of giving the index AM this
additional context can be pushed a lot further, though.
Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime. The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys. This approach is
far more flexible, and can be far more efficient. Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.
Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals. Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute. These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.
nbtree is now strictly guaranteed to avoid all repeat accesses to any
individual leaf page, even in cases with inequalities on high order
columns (except when the scan direction changes, or the scan restarts).
We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses. The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans. The added cost from repeatedly descending the
index still matters, but it can never dominate.
An important goal of this work is to remove all ScalarArrayOpExpr clause
special cases from the planner -- ScalarArrayOpExpr clauses can now be
thought of a generalization of simple equality clauses (except when
costing index scans, perhaps). The planner no longer needs to generate
alternative index paths with filter quals/qpquals. We assume that true
SAOP index quals are strictly better than filter/qpquals, since the work
in nbtree guarantees that they'll be at least slightly faster.
Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements. They benefit indirectly.
The planner no longer shows any restraint around making SAOP clauses
into true nbtree index quals, which tends to result in significant
savings on heap page accesses. In general we never need visibility
checks to evaluate true index quals, whereas filter quals often need to
perform extra heap accesses, just to eliminate non-matching tuples
(expression evaluation is only safe with known visible tuples).
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
src/include/access/nbtree.h | 37 +-
src/backend/access/nbtree/nbtree.c | 58 +-
src/backend/access/nbtree/nbtsearch.c | 72 +-
src/backend/access/nbtree/nbtutils.c | 1312 ++++++++++++++++++--
src/backend/optimizer/path/indxpath.c | 64 +-
src/backend/utils/adt/selfuncs.c | 123 +-
src/test/regress/expected/create_index.out | 61 +-
src/test/regress/expected/join.out | 5 +-
src/test/regress/sql/create_index.sql | 20 +-
9 files changed, 1462 insertions(+), 290 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f5c66964c..6ab5be544 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1045,9 +1045,11 @@ typedef struct BTScanOpaqueData
ScanKey arrayKeyData; /* modified copy of scan->keyData */
int numArrayKeys; /* number of equality-type array keys (-1 if
* there are any unsatisfiable array keys) */
- int arrayKeyCount; /* count indicating number of array scan keys
- * processed */
+ int numPrimScans; /* count indicating number of primitive index
+ * scans for array scan keys */
+ bool needPrimScan; /* Perform another primitive scan? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
+ FmgrInfo *orderProcs; /* ORDER procs for equality constraint keys */
MemoryContext arrayContext; /* scan-lifespan context for array data */
/* info about killed items if any (killedItems is NULL if never used) */
@@ -1078,6 +1080,29 @@ typedef struct BTScanOpaqueData
typedef BTScanOpaqueData *BTScanOpaque;
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the page high key. This must happen before the
+ * first call to _bt_checkkeys. _bt_checkkeys uses this information to manage
+ * advancement of the scan's array keys.
+ */
+typedef struct BTReadPageState
+{
+ /* Input parameters, set by _bt_readpage */
+ ScanDirection dir; /* current scan direction */
+ IndexTuple highkey; /* page high key, set by forward scans */
+
+ /* Output parameters, set by _bt_checkkeys */
+ bool continuescan; /* Terminate ongoing (primitive) index scan? */
+
+ /* Private _bt_checkkeys-managed state */
+ bool highkeychecked; /* high key checked against current
+ * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
/*
* We use some private sk_flags bits in preprocessed scan keys. We're allowed
* to use bits 16-31 (see skey.h). The uppermost bits are copied from the
@@ -1155,7 +1180,7 @@ extern bool btcanreturn(Relation index, int attno);
extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
@@ -1248,12 +1273,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 62bc9917f..5c1840436 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,9 @@
* BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
* to a new page; some process can start doing that.
*
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit). We reach this state once for every distinct
+ * primitive index scan.
*/
typedef enum
{
@@ -69,8 +70,8 @@ typedef struct BTParallelScanDescData
BTPS_State btps_pageStatus; /* indicates whether next page is
* available for scan. see above for
* possible states of parallel scan. */
- int btps_arrayKeyCount; /* count indicating number of array scan
- * keys processed by parallel scan */
+ int btps_numPrimScans; /* count indicating number of primitive
+ * index scans for array scan keys */
slock_t btps_mutex; /* protects above variables */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
} BTParallelScanDescData;
@@ -276,7 +277,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
if (res)
break;
/* ... otherwise see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
return res;
}
@@ -334,7 +335,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
}
}
/* Now see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
return ntids;
}
@@ -365,7 +366,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->arrayKeyData = NULL; /* assume no array keys for now */
so->numArrayKeys = 0;
+ so->needPrimScan = false;
so->arrayKeys = NULL;
+ so->orderProcs = NULL;
so->arrayContext = NULL;
so->killedItems = NULL; /* until needed */
@@ -405,7 +408,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
}
so->markItemIndex = -1;
- so->arrayKeyCount = 0;
+ so->numPrimScans = 0;
+ so->needPrimScan = false;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
@@ -586,7 +590,7 @@ btinitparallelscan(void *target)
SpinLockInit(&bt_target->btps_mutex);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- bt_target->btps_arrayKeyCount = 0;
+ bt_target->btps_numPrimScans = 0;
ConditionVariableInit(&bt_target->btps_cv);
}
@@ -612,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
SpinLockAcquire(&btscan->btps_mutex);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount = 0;
+ btscan->btps_numPrimScans = 0;
SpinLockRelease(&btscan->btps_mutex);
}
@@ -623,7 +627,17 @@ btparallelrescan(IndexScanDesc scan)
*
* The return value is true if we successfully seized the scan and false
* if we did not. The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys. It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
+ *
+ * XXX This particular aspect of the patch is still at the proof of concept
+ * stage. Having this much available for review at least suggests that it'll
+ * be feasible to port the existing parallel scan array scan key stuff over to
+ * using a primitive index scan counter (as opposed to an array key counter)
+ * the top-level scan. I have yet to really put this code through its paces.
*
* If the return value is true, *pageno returns the next or current page
* of the scan (depending on the scan direction). An invalid block number
@@ -654,7 +668,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
SpinLockAcquire(&btscan->btps_mutex);
pageStatus = btscan->btps_pageStatus;
- if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ if (so->numPrimScans < btscan->btps_numPrimScans)
{
/* Parallel scan has already advanced to a new set of scankeys. */
status = false;
@@ -695,9 +709,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
void
_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
{
+ BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
+ Assert(!so->needPrimScan);
+
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
@@ -731,12 +748,11 @@ _bt_parallel_done(IndexScanDesc scan)
parallel_scan->ps_offset);
/*
- * Mark the parallel scan as done for this combination of scan keys,
- * unless some other process already did so. See also
- * _bt_advance_array_keys.
+ * Mark the primitive index scan as done, unless some other process
+ * already did so. See also _bt_array_keys_remain.
*/
SpinLockAcquire(&btscan->btps_mutex);
- if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+ if (so->numPrimScans >= btscan->btps_numPrimScans &&
btscan->btps_pageStatus != BTPARALLEL_DONE)
{
btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -750,14 +766,14 @@ _bt_parallel_done(IndexScanDesc scan)
}
/*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- * keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ * counter when array keys are in use.
*
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
* scans.
*/
void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -766,13 +782,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- so->arrayKeyCount++;
+ so->numPrimScans++;
SpinLockAcquire(&btscan->btps_mutex);
if (btscan->btps_pageStatus == BTPARALLEL_DONE)
{
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount++;
+ btscan->btps_numPrimScans++;
}
SpinLockRelease(&btscan->btps_mutex);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 17ad89749..d51bc458b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -879,6 +879,18 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(!BTScanPosIsValid(so->currPos));
+ /*
+ * XXX Queries with SAOPs have always accounted for each call here as one
+ * "index scan". This meant that the accounting showed one index scan per
+ * distinct SAOP constant. This approach is consistent with how it was
+ * done before nbtree was taught to handle ScalarArrayOpExpr quals itself
+ * (it's also how non-amsearcharray index AMs still do it).
+ *
+ * Right now, eliding a primitive index scan elides a call here, resulting
+ * in one less "index scan" recorded by pgstat. This seems defensible,
+ * though not necessarily desirable. Now implementation details can have
+ * a significant impact on user-visible index scan counts.
+ */
pgstat_count_index_scan(rel);
/*
@@ -952,6 +964,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* one we use --- by definition, they are either redundant or
* contradictory.
*
+ * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+ * used to avoid prematurely stopping the scan when an array equality qual
+ * has its array keys advanced.
+ *
* Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
* If the index stores nulls at the end of the index we'll be starting
* from, and we have no boundary key for the column (which means the key
@@ -1536,9 +1552,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
BTPageOpaque opaque;
OffsetNumber minoff;
OffsetNumber maxoff;
+ BTReadPageState pstate;
int itemIndex;
- bool continuescan;
- int indnatts;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1558,8 +1573,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
}
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ pstate.dir = dir;
+ pstate.highkey = NULL;
+ pstate.continuescan = true; /* default assumption */
+ pstate.highkeychecked = false;
+
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
@@ -1594,6 +1612,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (ScanDirectionIsForward(dir))
{
+ /* SK_SEARCHARRAY scans must provide high key up front */
+ if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+ }
+
/* load items[] in ascending order */
itemIndex = 0;
@@ -1616,7 +1642,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ if (_bt_checkkeys(scan, &pstate, itup, false))
{
/* tuple passes all scan key conditions */
if (!BTreeTupleIsPosting(itup))
@@ -1649,7 +1675,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
/* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
+ if (!pstate.continuescan)
break;
offnum = OffsetNumberNext(offnum);
@@ -1666,17 +1692,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* only appear on non-pivot tuples on the right sibling page are
* common.
*/
- if (continuescan && !P_RIGHTMOST(opaque))
+ if (pstate.continuescan && !P_RIGHTMOST(opaque))
{
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
+ IndexTuple itup;
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ if (pstate.highkey)
+ itup = pstate.highkey;
+ else
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+ }
+
+ _bt_checkkeys(scan, &pstate, itup, true);
}
- if (!continuescan)
+ if (!pstate.continuescan)
so->currPos.moreRight = false;
Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1697,6 +1729,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
IndexTuple itup;
bool tuple_alive;
bool passes_quals;
+ bool finaltup = (offnum == minoff);
/*
* If the scan specifies not to return killed tuples, then we
@@ -1707,12 +1740,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* tuple on the page, we do check the index keys, to prevent
* uselessly advancing to the page to the left. This is similar
* to the high key optimization used by forward scans.
+ *
+ * Separately, _bt_checkkeys actually requires that we call it
+ * with the final non-pivot tuple from the page, if there's one
+ * (final processed tuple, or first tuple in offset number terms).
+ * We must indicate which particular tuple comes last, too.
*/
if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
{
Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
+ if (!finaltup)
{
+ Assert(offnum > minoff);
offnum = OffsetNumberPrev(offnum);
continue;
}
@@ -1724,8 +1763,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
+ passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup);
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions */
@@ -1764,7 +1802,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
}
- if (!continuescan)
+ if (!pstate.continuescan)
{
/* there can't be any more matches, so stop */
so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4d..c99518352 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
typedef struct BTSortArrayContext
{
- FmgrInfo flinfo;
+ FmgrInfo *orderproc;
Oid collation;
bool reverse;
} BTSortArrayContext;
@@ -41,15 +41,33 @@ typedef struct BTSortArrayContext
static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
StrategyNumber strat,
Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
static int _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(ScanKey cur, FmgrInfo *orderproc,
+ Datum datum, bool null,
+ Datum arrdatum);
+static int _bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+ BTArrayKeyInfo *array, ScanKey cur,
+ FmgrInfo *orderproc, Datum datum, bool null,
+ int32 *final_result);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+ BTReadPageState *pstate,
+ IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir);
static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
ScanKey leftarg, ScanKey rightarg,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger);
static bool _bt_check_rowcompare(ScanKey skey,
IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
ScanDirection dir, bool *continuescan);
@@ -202,6 +220,21 @@ _bt_freestack(BTStack stack)
* array keys, it's sufficient to find the extreme element value and replace
* the whole array with that scalar value.
*
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved). It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
+ * We always sort and deduplicate arrays up-front for equality array keys.
+ * ScalarArrayOpExpr execution need only visit leaf pages that might contain
+ * matches exactly once, while preserving the sort order of the index. This
+ * isn't just about performance; it also avoids needing duplicate elimination
+ * of matching TIDs (we prefer deduplicating search keys once, up-front).
+ * Equality SK_SEARCHARRAY keys are disjuncts that we always process in
+ * index/key space order, which makes this general approach feasible. Every
+ * index tuple will match no more than one single distinct combination of
+ * equality-constrained keys (array keys and other equality keys).
+ *
* Note: the reason we need so->arrayKeyData, rather than just scribbling
* on scan->keyData, is that callers are permitted to call btrescan without
* supplying a new set of scankey data.
@@ -212,6 +245,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int numberOfKeys = scan->numberOfKeys;
int16 *indoption = scan->indexRelation->rd_indoption;
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
int numArrayKeys;
ScanKey cur;
int i;
@@ -265,6 +299,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+ so->orderProcs = (FmgrInfo *) palloc(nkeyatts * sizeof(FmgrInfo));
/* Now process each array key */
numArrayKeys = 0;
@@ -281,6 +316,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int j;
cur = &so->arrayKeyData[i];
+
+ /*
+ * Attributes with equality-type scan keys (including but not limited
+ * to array scan keys) will need a 3-way comparison function.
+ *
+ * XXX Clean this up some more. This repeats some of the same work
+ * when there are multiple scan keys for the same key column.
+ */
+ if (cur->sk_strategy == BTEqualStrategyNumber)
+ _bt_sort_cmp_func_setup(scan, cur);
+
if (!(cur->sk_flags & SK_SEARCHARRAY))
continue;
@@ -436,6 +482,42 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
return result;
}
+/*
+ * Look up the appropriate comparison function in the opfamily.
+ *
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
+ */
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ Oid elemtype;
+ RegProcedure cmp_proc;
+ FmgrInfo *orderproc = &so->orderProcs[skey->sk_attno - 1];
+
+ /*
+ * Determine the nominal datatype of the array elements. We have to
+ * support the convention that sk_subtype == InvalidOid means the opclass
+ * input type; this is a hack to simplify life for ScanKeyInit().
+ */
+ elemtype = skey->sk_subtype;
+ if (elemtype == InvalidOid)
+ elemtype = rel->rd_opcintype[skey->sk_attno - 1];
+
+ cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+ rel->rd_opcintype[skey->sk_attno - 1],
+ elemtype,
+ BTORDER_PROC);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+ BTORDER_PROC, elemtype, elemtype,
+ rel->rd_opfamily[skey->sk_attno - 1]);
+ fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
/*
* _bt_sort_array_elements() -- sort and de-dup array elements
*
@@ -450,42 +532,14 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems)
{
- Relation rel = scan->indexRelation;
- Oid elemtype;
- RegProcedure cmp_proc;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
BTSortArrayContext cxt;
if (nelems <= 1)
return nelems; /* no work to do */
- /*
- * Determine the nominal datatype of the array elements. We have to
- * support the convention that sk_subtype == InvalidOid means the opclass
- * input type; this is a hack to simplify life for ScanKeyInit().
- */
- elemtype = skey->sk_subtype;
- if (elemtype == InvalidOid)
- elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
- /*
- * Look up the appropriate comparison function in the opfamily.
- *
- * Note: it's possible that this would fail, if the opfamily is
- * incomplete, but it seems quite unlikely that an opfamily would omit
- * non-cross-type support functions for any datatype that it supports at
- * all.
- */
- cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
- elemtype,
- elemtype,
- BTORDER_PROC);
- if (!RegProcedureIsValid(cmp_proc))
- elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
- BTORDER_PROC, elemtype, elemtype,
- rel->rd_opfamily[skey->sk_attno - 1]);
-
/* Sort the array elements */
- fmgr_info(cmp_proc, &cxt.flinfo);
+ cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
cxt.collation = skey->sk_collation;
cxt.reverse = reverse;
qsort_arg(elems, nelems, sizeof(Datum),
@@ -507,7 +561,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
int32 compare;
- compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+ compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
cxt->collation,
da, db));
if (cxt->reverse)
@@ -515,6 +569,171 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
return compare;
}
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This code is loosely based on _bt_compare. However, there are some
+ * important differences.
+ *
+ * It is convenient to think of calling _bt_compare as comparing caller's
+ * insertion scankey to an index tuple. But our callers are not searching
+ * through the index at all -- they're searching through a local array of
+ * datums associated with a scan key (using values they've taken from an index
+ * tuple). This is a complete reversal of how things usually work, which can
+ * be confusing.
+ *
+ * Callers of this function should think of it as comparing "datum" (as well
+ * as "null") to "arrdatum". This is the same approach that _bt_compare takes
+ * in that both functions compare the value that they're searching for to one
+ * particular item used as a binary search pivot. (But it's the wrong way
+ * around if you think of it as "tuple values vs scan key values". So don't.)
+*/
+static inline int32
+_bt_compare_array_skey(ScanKey cur,
+ FmgrInfo *orderproc,
+ Datum datum,
+ bool null,
+ Datum arrdatum)
+{
+ int32 result = 0;
+
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ if (cur->sk_flags & SK_ISNULL) /* array/scan key is NULL */
+ {
+ if (null)
+ result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NULL "<" NOT_NULL */
+ else
+ result = -1; /* NULL ">" NOT_NULL */
+ }
+ else if (null) /* array/scan key is NOT_NULL and tuple item
+ * is NULL */
+ {
+ if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NOT_NULL ">" NULL */
+ else
+ result = 1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * Like _bt_compare, we need to be careful of cross-type comparisons,
+ * so the left value has to be the value that came from an index
+ * tuple. (Array scan keys cannot be cross-type, but other required
+ * scan keys that use an equal operator can be.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+ datum, arrdatum));
+
+ /*
+ * Unlike _bt_compare, we flip the sign when column is a DESC column
+ * (and *not* when column is ASC). This matches the approach taken by
+ * _bt_check_rowcompare, which performs similar three-way comparisons.
+ */
+ if (cur->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan). This allows searches against required scan key arrays to
+ * reuse the work of earlier searches, at least in many important cases.
+ * Array keys covering key space that the index scan already processed cannot
+ * possibly contain any matches.
+ *
+ * XXX There are several fairly obvious optimizations that we could apply here
+ * (e.g., precheck searches for earlier subsets of a larger array would help).
+ * Revisit this during the next round of performance validation.
+ *
+ * Returns an index to the first array element >= caller's datum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to searched-for datum.
+ */
+static int
+_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+ BTArrayKeyInfo *array, ScanKey cur,
+ FmgrInfo *orderproc, Datum datum, bool null,
+ int32 *final_result)
+{
+ int low_elem,
+ high_elem,
+ first_elem_dir,
+ result = 0;
+ bool knownequal = false;
+
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ first_elem_dir = 0;
+ low_elem = array->cur_elem;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ low_elem = 0;
+ }
+ else
+ {
+ first_elem_dir = array->num_elems - 1;
+ low_elem = 0;
+ high_elem = array->cur_elem;
+ if (cur_elem_start)
+ {
+ low_elem = 0;
+ high_elem = first_elem_dir;
+ }
+ }
+
+ while (high_elem > low_elem)
+ {
+ int mid_elem = low_elem + ((high_elem - low_elem) / 2);
+ Datum arrdatum = array->elem_values[mid_elem];
+
+ result = _bt_compare_array_skey(cur, orderproc, datum, null, arrdatum);
+
+ if (result == 0)
+ {
+ /*
+ * Each array was deduplicated during initial preprocessing, so
+ * there each element is guaranteed to be unique. We can quit as
+ * soon as we see an equal array, saving ourselves an extra
+ * comparison or two...
+ */
+ low_elem = mid_elem;
+ knownequal = true;
+ break;
+ }
+
+ if (result > 0)
+ low_elem = mid_elem + 1;
+ else
+ high_elem = mid_elem;
+ }
+
+ /*
+ * ... but our caller also cares about the position of the searched-for
+ * datum relative to the low_elem match we'll return. Make sure that we
+ * set *final_result to the result that comes from comparing low_elem's
+ * key value to the datum that caller had us search for.
+ */
+ if (!knownequal)
+ result = _bt_compare_array_skey(cur, orderproc, datum, null,
+ array->elem_values[low_elem]);
+
+ *final_result = result;
+
+ return low_elem;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -541,70 +760,20 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
}
}
-/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
- *
- * Returns true if there is another set of values to consider, false if not.
- * On true result, the scankeys are initialized with the next set of values.
- */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- bool found = false;
- int i;
-
- /*
- * We must advance the last array key most quickly, since it will
- * correspond to the lowest-order index column among the available
- * qualifications. This is necessary to ensure correct ordering of output
- * when there are multiple array keys.
- */
- for (i = so->numArrayKeys - 1; i >= 0; i--)
- {
- BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
- ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
- int cur_elem = curArrayKey->cur_elem;
- int num_elems = curArrayKey->num_elems;
-
- if (ScanDirectionIsBackward(dir))
- {
- if (--cur_elem < 0)
- {
- cur_elem = num_elems - 1;
- found = false; /* need to advance next array key */
- }
- else
- found = true;
- }
- else
- {
- if (++cur_elem >= num_elems)
- {
- cur_elem = 0;
- found = false; /* need to advance next array key */
- }
- else
- found = true;
- }
-
- curArrayKey->cur_elem = cur_elem;
- skey->sk_argument = curArrayKey->elem_values[cur_elem];
- if (found)
- break;
- }
-
- /* advance parallel scan */
- if (scan->parallel_scan != NULL)
- _bt_parallel_advance_array_keys(scan);
-
- return found;
-}
-
/*
* _bt_mark_array_keys() -- Handle array keys during btmarkpos
*
* Save the current state of the array keys as the "mark" position.
+ *
+ * XXX The current set of array keys are not independent of the current scan
+ * position, so why treat them that way?
+ *
+ * We shouldn't even bother remembering the current array keys when btmarkpos
+ * is called. The array keys should be handled lazily instead. If and when
+ * btrestrpos is called, it can just set every array's cur_elem to the first
+ * element for the current scan direction. When _bt_advance_array_keys is
+ * reached (during the first call to _bt_checkkeys that follows), it will
+ * automatically search for the relevant array keys using caller's tuple.
*/
void
_bt_mark_array_keys(IndexScanDesc scan)
@@ -660,6 +829,749 @@ _bt_restore_array_keys(IndexScanDesc scan)
}
}
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) might need to advance the scan's array
+ * keys.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans). This means that it cannot possibly be time to advance the array
+ * keys just yet. _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfy our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans). This means that it might be time for our
+ * caller to advance the array keys to the next set.
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums. See
+ * comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ bool tuple_before_array_keys = false;
+ ScanKey cur;
+ int ntupatts = BTreeTupleGetNAtts(tuple, rel),
+ ikey;
+
+ Assert(so->qual_ok);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->numberOfKeys > 0);
+ Assert(!so->needPrimScan);
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ int attnum = cur->sk_attno;
+ FmgrInfo *orderproc;
+ Datum datum;
+ bool null,
+ skrequired;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required in the current scan
+ * direction
+ */
+ skrequired = ((ScanDirectionIsForward(dir) &&
+ (cur->sk_flags & SK_BT_REQFWD)) ||
+ (ScanDirectionIsBackward(dir) &&
+ (cur->sk_flags & SK_BT_REQBKWD)));
+
+ /*
+ * Unlike _bt_advance_array_keys, we never deal with any non-required
+ * array keys. Cases where skrequiredtrigger is set to false by
+ * _bt_check_compare should never call here. We are only called after
+ * _bt_check_compare provisionally indicated that the scan should be
+ * terminated due to a _required_ scan key not being satisfied.
+ *
+ * We expect _bt_check_compare to notice and report required scan keys
+ * before non-required ones. _bt_advance_array_keys might still have
+ * to advance non-required array keys in passing for a tuple that we
+ * were called for, but _bt_advance_array_keys doesn't rely on us to
+ * give it advanced notice of that.
+ */
+ if (!skrequired)
+ break;
+
+ if (attnum > ntupatts)
+ {
+ /*
+ * When we reach a high key's truncated attribute, assume that the
+ * tuple attribute's value is >= the scan's search-type scan keys
+ */
+ break;
+ }
+
+ datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+ orderproc = &so->orderProcs[attnum - 1];
+ result = _bt_compare_array_skey(cur, orderproc,
+ datum, null,
+ cur->sk_argument);
+
+ if (result != 0)
+ {
+ if (ScanDirectionIsForward(dir))
+ tuple_before_array_keys = result < 0;
+ else
+ tuple_before_array_keys = result > 0;
+
+ break;
+ }
+ }
+
+ return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if there is another set of values to consider, false if not.
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ /*
+ * Array keys are advanced within _bt_checkkeys when the scan reaches the
+ * leaf level (more precisely, they're advanced when the scan reaches the
+ * end of each distinct set of array elements). This process avoids
+ * repeat access to leaf pages (across multiple primitive index scans) by
+ * opportunistically advancing the scan's array keys when it allows the
+ * primitive index scan to find nearby matching tuples (or to eliminate
+ * array keys with no matching tuples from further consideration).
+ *
+ * _bt_checkkeys sets a simple flag variable that we check here. This
+ * tells us if we need to perform another primitive index scan for the
+ * now-current array keys or not. We'll unset the flag once again to
+ * acknowledge having started a new primitive scan (or we'll see that it
+ * isn't set and end the top-level scan right away).
+ *
+ * We cannot rely on _bt_first always reaching _bt_checkkeys here. There
+ * are various scenarios where that won't happen. For example, if the
+ * index is completely empty, then _bt_first won't get as far as calling
+ * _bt_readpage/_bt_checkkeys.
+ *
+ * We also don't expect _bt_checkkeys to be reached when searching for a
+ * non-existent value that happens to be higher than any existing value in
+ * the index. There won't a high key call to _bt_checkkeys if the only
+ * call to _bt_readpage is for the rightmost page, if _bt_binsrch told
+ * _bt_readpage to start at the very end of the rightmost page. There is
+ * a similar issue for backwards scans, too.
+ *
+ * We don't actually require special handling for these cases -- we don't
+ * need to be explicitly instructed to _not_ perform another primitive
+ * index scan. This is correct for all of the cases we've listed so far,
+ * which all involve primitive index scans that access pages "near the
+ * boundaries of the key space" (the leftmost page, the rightmost page, or
+ * an imaginary empty leaf root page). If _bt_checkkeys cannot be reached
+ * by a primitive index scan for one set of array keys, it follows that it
+ * also won't be reached for later set of array keys.
+ *
+ * There is one exception, that requires handling by us as a special case:
+ * the case where _bt_first's call to _bt_preprocess_keys determined that
+ * the scan keys for its would-be scan can never be satisfied. That might
+ * be true for one set of array keys, but not the next set. This is the
+ * only case where we advance the array keys for ourselves, rather than
+ * leaving it up to _bt_checkkeys.
+ */
+ if (!so->qual_ok)
+ {
+ /* _bt_first backed out; increment array keys, and try again */
+ so->needPrimScan = false;
+ if (_bt_advance_array_keys_increment(scan, dir))
+ return true;
+ }
+
+ /* Time for another primitive index scan? */
+ if (so->needPrimScan)
+ {
+ /* Begin primitive index scan */
+ so->needPrimScan = false;
+
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_next_primitive_scan(scan);
+
+ return true;
+ }
+
+ /*
+ * No more primitive index scans. Just terminate the top-level scan.
+ */
+ _bt_advance_array_keys_to_end(scan, dir);
+
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_done(scan);
+
+ return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false. When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple. It
+ * is okay to let the second call set pstate.continuescan=false without
+ * further intervention, since we know that it can only be for a scan key that
+ * is required in one direction.
+ *
+ * When called with skrequiredtrigger, we don't expect to have to advance any
+ * non-required scan keys. We'll always set pstate.continuescan because a
+ * non-required scan key can never terminate the scan.
+ *
+ * Required array keys are always advanced to the highest element >= the
+ * corresponding tuple attribute values for its most significant non-equal
+ * column (or the next lowest set <= the tuple value during backwards scans).
+ * If we reach the end of the array keys for the current scan direction, we
+ * end the top-level index scan.
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys (or <= during backward
+ * scans). This must be established first, before calling here.
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple. We fall back on "incrementally" advancing the
+ * array keys in these cases, which involve inequality strategy scan keys.
+ * For example, with a composite index on (a, b) and a qual "WHERE a IN (3,5)
+ * AND b < 42", we'll be called for both "a" arry keys (keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42". Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called. The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)". The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time. This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys. We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are. We must not suppress cases where a call to
+ * _bt_check_compare sets continuescan=false for a required scan key that uses
+ * the equality strategy (only inequality-type scan keys get that treatment).
+ * We don't want to suppress the scan's termination when it's inappropriate.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0,
+ ntupatts = BTreeTupleGetNAtts(tuple, rel);
+ bool arrays_advanced = false,
+ arrays_done = false,
+ all_skrequired_atts_wrapped = skrequiredtrigger,
+ all_atts_equal = true;
+
+ Assert(so->numberOfKeys > 0);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->qual_ok);
+
+ /*
+ * Try to advance array keys via a series of binary searches. We'll
+ * perform one search for each SK_SEARCHARRAY scan key (excluding array
+ * quals that don't use an equality type operator/strategy, which aren't
+ * backed by an array at all).
+ */
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array = NULL;
+ ScanKey skeyarray = NULL;
+ FmgrInfo *orderproc;
+ int attnum = cur->sk_attno,
+ first_elem_dir,
+ final_elem_dir,
+ set_elem;
+ Datum datum;
+ bool skrequired,
+ null;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required in the current scan
+ * direction
+ */
+ skrequired = ((ScanDirectionIsForward(dir) &&
+ (cur->sk_flags & SK_BT_REQFWD)) ||
+ (ScanDirectionIsBackward(dir) &&
+ (cur->sk_flags & SK_BT_REQBKWD)));
+
+ /*
+ * Optimization: we don't have to advance remaining non-required array
+ * keys when we already know that tuple won't be returned by the scan.
+ *
+ * Deliberately check this both here and after the binary search.
+ */
+ if (!skrequired && !all_atts_equal)
+ break;
+
+ /*
+ * We need to check required non-array scan keys (that use the equal
+ * strategy), as well as required and non-required array scan keys
+ * (also limited to those that use the equal strategy, since array
+ * inequalities degenerate into a simple comparison).
+ *
+ * Perform initial set up for this scan key. If it is backed by an
+ * array then we need to set variables describing the current position
+ * in the array.
+ *
+ * This loop iterates through the current scankeys (so->keyData, which
+ * were output by _bt_preprocess_keys earlier) and then sets input
+ * scan keys (so->arrayKeyData scan keys) to new array values. This
+ * sets things up for the next _bt_preprocess_keys call.
+ */
+ orderproc = &so->orderProcs[attnum - 1];
+ first_elem_dir = 0; /* keep compiler quiet */
+ final_elem_dir = 0; /* keep compiler quiet */
+ if (cur->sk_flags & SK_SEARCHARRAY)
+ {
+ /* Set up array comparison function */
+ Assert(arrayidx < so->numArrayKeys);
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+
+ /*
+ * It's possible that _bt_preprocess_keys determined that an
+ * individual array scan key wasn't required in so->keyData for
+ * the ongoing primitive index scan due to it being redundant or
+ * contradictory (the current array value might be redundant next
+ * to some other scan key on the same attribute). Deal with that.
+ */
+ if (unlikely(skeyarray->sk_attno != attnum))
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY = false;
+
+ for (; arrayidx < so->numArrayKeys; arrayidx++)
+ {
+ array = &so->arrayKeys[arrayidx];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+ if (skeyarray->sk_attno == attnum)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ Assert(found);
+ }
+
+ if (ScanDirectionIsForward(dir))
+ {
+ first_elem_dir = 0;
+ final_elem_dir = array->num_elems - 1;
+ }
+ else
+ {
+ first_elem_dir = array->num_elems - 1;
+ final_elem_dir = 0;
+ }
+ }
+ else if (attnum > ntupatts)
+ {
+ /*
+ * Nothing needs to be done when we have a truncated attribute
+ * (possible when caller's tuple is a page high key) and a
+ * non-array scan key
+ */
+ Assert(ScanDirectionIsForward(dir));
+ continue;
+ }
+
+ /*
+ * Here we perform steps for any required scan keys after the first
+ * non-equal required scan key. The first scan key must have been set
+ * to a value > the value from the tuple back when we dealt with it
+ * (or, for a backwards scan, to a value < the value from the tuple).
+ * That needs to "cascade" to lower-order array scan keys. They must
+ * be set to the first array element for the current scan direction.
+ *
+ * We're still setting the keys to values >= the tuple here -- it just
+ * needs to work for the tuple as a whole. For example, when a tuple
+ * "(a, b) = (42, 5)" advances the array keys on "a" from 40 to 45, we
+ * must also set "b" to whatever the first array element for "b" is.
+ * It would be wrong to allow "b" to be set to a value from the tuple,
+ * since the value is actually from a different part of the key space.
+ *
+ * Also defensively do this with truncated attributes when caller's
+ * tuple is a page high key.
+ */
+ if (array && ((arrays_advanced && !all_atts_equal) ||
+ attnum > ntupatts))
+ {
+ /* Shouldn't reach this far for a non-required scan key */
+ Assert(skrequired && skrequiredtrigger && attnum > 1);
+
+ /*
+ * We set the array to the first element (if needed) here, and we
+ * don't unset all_required_atts_wrapped. This array therefore
+ * counts as a wrapped array when we go on to determine if all of
+ * the required arrays have wrapped (after this loop).
+ */
+ if (array->cur_elem != first_elem_dir)
+ {
+ array->cur_elem = first_elem_dir;
+ skeyarray->sk_argument = array->elem_values[first_elem_dir];
+ arrays_advanced = true;
+ }
+
+ continue;
+ }
+
+ /*
+ * Going to compare scan key to corresponding tuple attribute value
+ */
+ datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+ if (!array)
+ {
+ if (!skrequired || !all_atts_equal)
+ continue;
+
+ /*
+ * This is a required non-array scan key that uses the equal
+ * strategy. See header comments for an explanation of why we
+ * need to do this.
+ */
+ result = _bt_compare_array_skey(cur, orderproc, datum, null,
+ cur->sk_argument);
+
+ /*
+ * _bt_tuple_before_array_skeys should always prevent us from
+ * being called when the current tuple indicates that the scan
+ * isn't yet ready to have its array keys advanced. Check with an
+ * assert.
+ */
+ Assert((ScanDirectionIsForward(dir) && result >= 0) ||
+ (ScanDirectionIsBackward(dir) && result <= 0));
+
+ if (result != 0)
+ {
+ /*
+ * tuple attribute value is > scan key value (or < scan key
+ * value in the backward scan case).
+ */
+ all_atts_equal = false;
+ break;
+ }
+
+ continue;
+ }
+
+ /*
+ * Binary search for an array key >= the tuple value, which we'll then
+ * set as our current array key (or <= the tuple value if this is a
+ * backward scan).
+ *
+ * The binary search excludes array keys that we've already processed
+ * from consideration, except with a non-required scan key's array.
+ * This is not just an optimization -- it's important for correctness.
+ * It is crucial that required array scan keys only have their array
+ * keys advanced in the current scan direction. We need to advance
+ * required array keys in lock step with the index scan.
+ *
+ * Note in particular that arrays_advanced must only be set when the
+ * array is advanced to a key >= the existing key, or <= for a
+ * backwards scan. (Though see notes about wraparound below.)
+ */
+ set_elem = _bt_binsrch_array_skey(dir, (!skrequired || arrays_advanced),
+ array, cur, orderproc, datum, null,
+ &result);
+
+ /*
+ * Maintain the state that tracks whether all attribute from the tuple
+ * are equal to the array keys that we've set as current (or existing
+ * array keys set during earlier calls here).
+ */
+ if (result != 0)
+ all_atts_equal = false;
+
+ /*
+ * Optimization: we don't have to advance remaining non-required array
+ * keys when we already know that tuple won't be returned by the scan.
+ * Quit before setting the array keys to avoid _bt_preprocess_keys.
+ *
+ * Deliberately check this both before and after the binary search.
+ */
+ if (!skrequired && !all_atts_equal)
+ break;
+
+ /*
+ * If the binary search indicates that the key space for this tuple
+ * attribute value is > the key value from the final element in the
+ * array (final for the current scan direction), we handle it by
+ * wrapping around to the first element of the array.
+ *
+ * Wrapping around simplifies advancement with a multi-column index by
+ * allowing us to treat wrapping a column as advancing the column. We
+ * preserve the invariant that a required scan key's array may only be
+ * ratcheted forward (backwards when the scan direction is backwards),
+ * while still always being able to "advance" the array at this point.
+ */
+ if (set_elem == final_elem_dir &&
+ ((ScanDirectionIsForward(dir) && result > 0) ||
+ (ScanDirectionIsBackward(dir) && result < 0)))
+ {
+ /* Perform wraparound */
+ set_elem = first_elem_dir;
+ }
+ else if (skrequired)
+ {
+ /* Won't call _bt_advance_array_keys_to_end later */
+ all_skrequired_atts_wrapped = false;
+ }
+
+ Assert(set_elem >= 0 && set_elem < array->num_elems);
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ skeyarray->sk_argument = array->elem_values[set_elem];
+ arrays_advanced = true;
+
+ /*
+ * We shouldn't have to advance a required array when called due
+ * to _bt_check_compare determining that a non-required array
+ * needs to be advanced. We expect _bt_check_compare to notice
+ * and report required scan keys before non-required ones.
+ */
+ Assert(skrequiredtrigger || !skrequired);
+ }
+ }
+
+ if (!skrequiredtrigger)
+ {
+ /*
+ * Failing to satisfy a non-required array scan key shouldn't ever
+ * result in terminating the (primitive) index scan
+ */
+ }
+ else if (all_skrequired_atts_wrapped)
+ {
+ /*
+ * The binary searches for each tuple's attribute value in the scan
+ * key's corresponding SK_SEARCHARRAY array all found that the tuple's
+ * value are "past the end" of the key space covered by each array
+ */
+ _bt_advance_array_keys_to_end(scan, dir);
+ arrays_done = true;
+ all_atts_equal = false; /* at least not now */
+ }
+ else if (!arrays_advanced)
+ {
+ /*
+ * We must always advance the array keys by at least one increment
+ * (except when called to advance a non-required scan key's array).
+ *
+ * We need this fallback for cases where the existing array keys and
+ * existing required equal-strategy scan keys were fully equal to the
+ * tuple. _bt_check_compare may have set continuescan=false due to an
+ * inequality terminating the scan, which we don't deal with directly.
+ * (See function's header comments for an example.)
+ */
+ if (_bt_advance_array_keys_increment(scan, dir))
+ arrays_advanced = true;
+ else
+ arrays_done = true;
+ all_atts_equal = false; /* at least not now */
+ }
+
+ /*
+ * Might make sense to recheck the high key later on in cases where we
+ * just advanced the keys (unless we were just called to advance the
+ * scan's non-required array keys)
+ */
+ if (arrays_advanced && skrequiredtrigger)
+ pstate->highkeychecked = false;
+
+ /*
+ * If we changed the array keys without exhausting all array keys then we
+ * need to preprocess our search-type scan keys once more
+ */
+ Assert(skrequiredtrigger || !arrays_done);
+ if (arrays_advanced && !arrays_done)
+ {
+ /*
+ * XXX Think about buffer-lock-held hazards here some more.
+ *
+ * In almost all interesting cases we only really need to copy over
+ * the array values (from "so->arrayKeyData" to "so->keyData"). But
+ * there are at least some cases where preprocessing scan keys to
+ * notice redundant and contradictory keys might be interesting here.
+ */
+ _bt_preprocess_keys(scan);
+ }
+
+ /* Are we now done with the top-level scan (barring a btrescan)? */
+ Assert(!so->needPrimScan);
+ if (!so->qual_ok)
+ {
+ /* Not when we have unsatisfiable quals for new array keys, ever */
+ Assert(skrequiredtrigger);
+
+ pstate->continuescan = false;
+ pstate->highkeychecked = true;
+ all_atts_equal = false; /* at least not now */
+
+ if (_bt_advance_array_keys_increment(scan, dir))
+ so->needPrimScan = true;
+ }
+ else if (!skrequiredtrigger)
+ {
+ /* Not when we failed to satisfy a non-required scan key, ever */
+ Assert(!arrays_done);
+ pstate->continuescan = true;
+ }
+ else if (arrays_done)
+ {
+ /*
+ * Yep -- this primitive scan was our last
+ */
+ Assert(!all_atts_equal);
+ pstate->continuescan = false;
+ }
+ else if (!all_atts_equal)
+ {
+ /*
+ * Not done. The top-level index scan (and primitive index scan) will
+ * continue, since the array keys advanced.
+ */
+ Assert(arrays_advanced);
+ pstate->continuescan = true;
+
+ /*
+ * Some required array keys might have wrapped around during this
+ * call, but it can't have been the most significant array scan key.
+ */
+ Assert(!all_skrequired_atts_wrapped);
+ }
+ else
+ {
+ /*
+ * Not done. A second call to _bt_check_compare must now take place.
+ * It will make the final decision on setting continuescan.
+ */
+ }
+
+ return all_atts_equal;
+}
+
+/*
+ * Advance the array keys by a single increment in the current scan direction
+ */
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found = false;
+ int i;
+
+ Assert(!so->needPrimScan);
+
+ /*
+ * We must advance the last array key most quickly, since it will
+ * correspond to the lowest-order index column among the available
+ * qualifications. This is necessary to ensure correct ordering of output
+ * when there are multiple array keys.
+ */
+ for (i = so->numArrayKeys - 1; i >= 0; i--)
+ {
+ BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
+ int cur_elem = curArrayKey->cur_elem;
+ int num_elems = curArrayKey->num_elems;
+
+ if (ScanDirectionIsBackward(dir))
+ {
+ if (--cur_elem < 0)
+ {
+ cur_elem = num_elems - 1;
+ found = false; /* need to advance next array key */
+ }
+ else
+ found = true;
+ }
+ else
+ {
+ if (++cur_elem >= num_elems)
+ {
+ cur_elem = 0;
+ found = false; /* need to advance next array key */
+ }
+ else
+ found = true;
+ }
+
+ curArrayKey->cur_elem = cur_elem;
+ skey->sk_argument = curArrayKey->elem_values[cur_elem];
+ if (found)
+ break;
+ }
+
+ return found;
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ */
+static void
+_bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(so->numArrayKeys);
+ Assert(!so->needPrimScan);
+
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
+ int reset_elem;
+
+ if (ScanDirectionIsForward(dir))
+ reset_elem = curArrayKey->num_elems - 1;
+ else
+ reset_elem = 0;
+
+ if (curArrayKey->cur_elem != reset_elem)
+ {
+ curArrayKey->cur_elem = reset_elem;
+ skey->sk_argument = curArrayKey->elem_values[reset_elem];
+ }
+ }
+}
/*
* _bt_preprocess_keys() -- Preprocess scan keys
@@ -1345,38 +2257,202 @@ _bt_mark_scankey_required(ScanKey skey)
*
* Return true if so, false if not. If the tuple fails to pass the qual,
* we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
+ * this tuple, and set pstate.continuescan accordingly. See comments for
* _bt_preprocess_keys(), above, about how this is done.
*
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the high
+ * key early, before we've expended too much effort on comparing tuples that
+ * cannot possibly be matches for any set of array keys. This is just an
+ * optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate. These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards). Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction). Any other order will
+ * lead to inconsistent array key state.
*
* scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
* tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
*/
bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup)
+{
+ TupleDesc tupdesc = RelationGetDescr(scan->indexRelation);
+ int natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool res;
+ bool skrequiredtrigger;
+
+ Assert(so->qual_ok);
+ Assert(pstate->continuescan);
+ Assert(!so->needPrimScan);
+
+ res = _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+ tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger);
+
+ /*
+ * Only one _bt_check_compare call is required in the common case where
+ * there are no equality-type array scan keys.
+ *
+ * When there are array scan keys then we can still accept the first
+ * answer we get from _bt_check_compare when continuescan wasn't unset.
+ */
+ if (!so->numArrayKeys || pstate->continuescan)
+ return res;
+
+ /*
+ * _bt_check_compare set continuescan=false in the presence of equality
+ * type array keys. It's possible that we haven't reached the start of
+ * the array keys just yet. It's also possible that we need to advance
+ * the array keys now. (Or perhaps we really do need to terminate the
+ * top-level scan.)
+ */
+ pstate->continuescan = true; /* new initial assumption */
+
+ if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+ {
+ /*
+ * Tuple is still < the current array scan key values (as well as
+ * other equality type scan keys) if this is a forward scan.
+ * (Backwards scans reach here with a tuple > equality constraints.)
+ * We must now consider how to proceed with the ongoing primitive
+ * index scan.
+ *
+ * Should _bt_readpage continue with this page for now, in the hope of
+ * finding tuples whose key space is covered by the current array keys
+ * before too long? Or, should it give up and start a new primitive
+ * index scan instead?
+ *
+ * Our policy is to terminate the primitive index scan at the end of
+ * the current page if the current (most recently advanced) array keys
+ * don't cover the final tuple from the page. This policy is fairly
+ * conservative.
+ *
+ * Note: In some cases we're effectively speculating that the next
+ * sibling leaf page will have tuples that are covered by the key
+ * space of our array keys (the current set or some nearby set), based
+ * on a cue from the current page's final tuple. There is at least a
+ * non-zero risk of wasting a page access -- we could gamble and lose.
+ * The details of all this are handled within _bt_advance_array_keys.
+ */
+ if (finaltup || (!pstate->highkeychecked && pstate->highkey &&
+ _bt_tuple_before_array_skeys(scan, pstate,
+ pstate->highkey)))
+ {
+ /*
+ * This is the final tuple (the high key for a forward scan, or
+ * the non-pivot tuple at the first offset number for a backward
+ * scan), and its still before the array keys. Give up now by
+ * starting a new primitive index scan.
+ *
+ * Have _bt_readpage stop the scan of this page immediately,
+ * starting a new primitive index scan. Another primitive index
+ * scan must be required (if the top-level scan could be
+ * terminated then we'd have done so by now).
+ *
+ * Note: _bt_readpage stashes the page high key, enabling us to
+ * make this check early in the case of forward scans. We thereby
+ * avoid scanning very many extra tuples on the page. This is
+ * purely an optimization -- it doesn't affect the behavior of the
+ * scan (not in a way that can be observed outside of
+ * _bt_readpage, at least).
+ */
+ pstate->continuescan = false;
+ so->needPrimScan = true;
+ }
+ else if (!finaltup && pstate->highkey)
+ {
+ /*
+ * Remember that the high key has been checked with this
+ * particular set of array keys.
+ *
+ * It might make sense to check the same high key again at some
+ * point during the ongoing _bt_readpage-wise scan of this page.
+ * But it is definitely wasteful to repeat the same high key check
+ * before the array keys are advanced by some later tuple.
+ */
+ pstate->highkeychecked = true;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual
+ */
+ return false;
+ }
+
+ /*
+ * Caller's tuple is >= the current set of array keys and other equality
+ * constraint scan keys (or <= if this is a backwards scans).
+ *
+ * It might be time to advance the array keys to the next set. Try doing
+ * that now, while determining in passing if the tuple matches the newly
+ * advanced set of array keys (if we've any left).
+ *
+ * This call will also set continuescan for us (or tells us to perform
+ * another _bt_check_compare call, which then sets continuescan for us).
+ */
+ if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+ {
+ /*
+ * Tuple doesn't match any later array keys, either (for one or more
+ * array type scan keys marked as required). Give up on this tuple
+ * being a match. (Call may also have terminated the primitive scan,
+ * or the top-level scan.)
+ */
+ return false;
+ }
+
+ /*
+ * Advanced array keys to values that are exact matches for corresponding
+ * attribute values from the tuple.
+ *
+ * It's fairly likely that the tuple satisfies all index scan conditions
+ * at this point, but we need confirmation of that. We also need to give
+ * _bt_check_compare a real opportunity to end the top-level index scan by
+ * setting continuescan=false. (_bt_advance_array_keys cannot deal with
+ * inequality strategy scan keys; we need _bt_check_compare for those.)
+ */
+ return _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+ tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys. It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan. It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger)
{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
int ikey;
ScanKey key;
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
*continuescan = true; /* default assumption */
+ *skrequiredtrigger = true; /* default assumption */
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ for (key = keyData, ikey = 0; ikey < keysz; key++, ikey++)
{
Datum datum;
bool isNull;
@@ -1497,6 +2573,10 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* qual fails, it is critical that equality quals be used for the
* initial positioning in _bt_first() when they are available. See
* comments in _bt_first().
+ *
+ * Scans with equality-type array scan keys run into a similar
+ * problem whenever they advance the array keys. Our caller uses
+ * _bt_tuple_before_array_skeys to avoid the problem there.
*/
if ((key->sk_flags & SK_BT_REQFWD) &&
ScanDirectionIsForward(dir))
@@ -1505,6 +2585,14 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
ScanDirectionIsBackward(dir))
*continuescan = false;
+ if ((key->sk_flags & SK_SEARCHARRAY) &&
+ key->sk_strategy == BTEqualStrategyNumber)
+ {
+ if (*continuescan)
+ *skrequiredtrigger = false;
+ *continuescan = false;
+ }
+
/*
* In any case, this indextuple doesn't match the qual.
*/
@@ -1523,7 +2611,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* it's not possible for any future tuples in the current scan direction
* to pass the qual.
*
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
*/
static bool
_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..f04ca1ee9 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop);
+ bool *skip_nonnative_saop);
static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
List *clauses, List *other_clauses);
static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
* index AM supports them natively, we should just include them in simple
* index paths. If not, we should exclude them while building simple index
* paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
*/
static void
get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
{
List *indexpaths;
bool skip_nonnative_saop = false;
- bool skip_lower_saop = false;
ListCell *lc;
/*
* Build simple index paths using the clauses. Allow ScalarArrayOpExpr
- * clauses only if the index AM supports them natively, and skip any such
- * clauses for index columns after the first (so that we produce ordered
- * paths if possible).
+ * clauses only if the index AM supports them natively.
*/
indexpaths = build_index_paths(root, rel,
index, clauses,
index->predOK,
ST_ANYSCAN,
- &skip_nonnative_saop,
- &skip_lower_saop);
-
- /*
- * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
- * that supports them, then try again including those clauses. This will
- * produce paths with more selectivity but no ordering.
- */
- if (skip_lower_saop)
- {
- indexpaths = list_concat(indexpaths,
- build_index_paths(root, rel,
- index, clauses,
- index->predOK,
- ST_ANYSCAN,
- &skip_nonnative_saop,
- NULL));
- }
+ &skip_nonnative_saop);
/*
* Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
index, clauses,
false,
ST_BITMAPSCAN,
- NULL,
NULL);
*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
* to true if we found any such clauses (caller must initialize the variable
* to false). If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
*
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false). If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
* 'rel' is the index's heap relation
* 'index' is the index for which we want to generate paths
* 'clauses' is the collection of indexable clauses (IndexClause nodes)
* 'useful_predicate' indicates whether the index has a useful predicate
* 'scantype' indicates whether we need plain or bitmap scan support
* 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
*/
static List *
build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop)
+ bool *skip_nonnative_saop)
{
List *result = NIL;
IndexPath *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
- bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
bool index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
* on by btree and possibly other places.) The list can be empty, if the
* index AM allows that.
*
- * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
- * index clause for a non-first index column. This prevents us from
- * assuming that the scan result is ordered. (Actually, the result is
- * still ordered if there are equality constraints for all earlier
- * columns, but it seems too expensive and non-modular for this code to be
- * aware of that refinement.)
- *
* We also build a Relids set showing which outer rels are required by the
* selected clauses. Any lateral_relids are included in that, but not
* otherwise accounted for.
*/
index_clauses = NIL;
- found_lower_saop_clause = false;
outer_relids = bms_copy(rel->lateral_relids);
for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
{
@@ -917,16 +876,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/* Caller had better intend this only for bitmap scan */
Assert(scantype == ST_BITMAPSCAN);
}
- if (indexcol > 0)
- {
- if (skip_lower_saop)
- {
- /* Caller doesn't want to lose index ordering */
- *skip_lower_saop = true;
- continue;
- }
- found_lower_saop_clause = true;
- }
}
/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/*
* 2. Compute pathkeys describing index's ordering, if any, then see how
* many of them are actually useful for this query. This is not relevant
- * if we are only trying to build bitmap indexscans, nor if we have to
- * assume the scan is unordered.
+ * if we are only trying to build bitmap indexscans.
*/
pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
- !found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
index, &clauseset,
useful_predicate,
ST_BITMAPSCAN,
- NULL,
NULL);
result = list_concat(result, indexpaths);
}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..c796b53a6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
double numIndexTuples;
double spc_random_page_cost;
double num_sa_scans;
- double num_outer_scans;
- double num_scans;
double qual_op_cost;
double qual_arg_cost;
List *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
/*
* Check for ScalarArrayOpExpr index quals, and estimate the number of
- * index scans that will be performed.
+ * primitive index scans that will be performed for caller
*/
num_sa_scans = 1;
foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
*/
numIndexTuples = costs->numIndexTuples;
if (numIndexTuples <= 0.0)
- {
numIndexTuples = indexSelectivity * index->rel->tuples;
- /*
- * The above calculation counts all the tuples visited across all
- * scans induced by ScalarArrayOpExpr nodes. We want to consider the
- * average per-indexscan number, so adjust. This is a handy place to
- * round to integer, too. (If caller supplied tuple estimate, it's
- * responsible for handling these considerations.)
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
- }
-
/*
* We can bound the number of tuples by the index size in any case. Also,
* always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
*
* The above calculations are all per-index-scan. However, if we are in a
* nestloop inner scan, we can expect the scan to be repeated (with
- * different search keys) for each row of the outer relation. Likewise,
- * ScalarArrayOpExpr quals result in multiple index scans. This creates
- * the potential for cache effects to reduce the number of disk page
- * fetches needed. We want to estimate the average per-scan I/O cost in
- * the presence of caching.
+ * different search keys) for each row of the outer relation. This
+ * creates the potential for cache effects to reduce the number of disk
+ * page fetches needed. We want to estimate the average per-scan I/O cost
+ * in the presence of caching.
*
* We use the Mackert-Lohman formula (see costsize.c for details) to
* estimate the total number of page fetches that occur. While this
* wasn't what it was designed for, it seems a reasonable model anyway.
* Note that we are counting pages not tuples anymore, so we take N = T =
* index size, as if there were one "tuple" per page.
+ *
+ * Note: we assume that there will be no repeat index page fetches across
+ * ScalarArrayOpExpr primitive scans from the same logical index scan.
+ * This is guaranteed to be true for btree indexes, but is very optimistic
+ * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+ * However, these same index AMs also accept our default pessimistic
+ * approach to counting num_sa_scans (btree caller caps this), so we don't
+ * expect the final indexTotalCost to be wildly over-optimistic.
*/
- num_outer_scans = loop_count;
- num_scans = num_sa_scans * num_outer_scans;
-
- if (num_scans > 1)
+ if (loop_count > 1)
{
double pages_fetched;
/* total page fetches ignoring cache effects */
- pages_fetched = numIndexPages * num_scans;
+ pages_fetched = numIndexPages * loop_count;
/* use Mackert and Lohman formula to adjust for cache effects */
pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
/*
* Now compute the total disk access cost, and then report a pro-rated
- * share for each outer scan. (Don't pro-rate for ScalarArrayOpExpr,
- * since that's internal to the indexscan.)
+ * share for each outer scan
*/
- indexTotalCost = (pages_fetched * spc_random_page_cost)
- / num_outer_scans;
+ indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
}
else
{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
* evaluated once at the start of the scan to reduce them to runtime keys
* to pass to the index AM (see nodeIndexscan.c). We model the per-tuple
* CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
- * indexqual operator. Because we have numIndexTuples as a per-scan
- * number, we have to multiply by num_sa_scans to get the correct result
- * for ScalarArrayOpExpr cases. Similarly add in costs for any index
- * ORDER BY expressions.
+ * indexqual operator. Similarly add in costs for any index ORDER BY
+ * expressions.
*
* Note: this neglects the possible costs of rechecking lossy operators.
* Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
indexStartupCost = qual_arg_cost;
indexTotalCost += qual_arg_cost;
- indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+ indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
/*
* Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
bool eqQualHere;
bool found_saop;
bool found_is_null_op;
- double num_sa_scans;
ListCell *lc;
/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
*
* For a RowCompareExpr, we consider only the first column, just as
* rowcomparesel() does.
- *
- * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
- * index scans not one, but the ScalarArrayOpExpr's operator can be
- * considered to act the same as it normally does.
*/
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
found_saop = false;
found_is_null_op = false;
- num_sa_scans = 1;
foreach(lc, path->indexclauses)
{
IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
else if (IsA(clause, ScalarArrayOpExpr))
{
ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
- Node *other_operand = (Node *) lsecond(saop->args);
- int alength = estimate_array_length(other_operand);
clause_op = saop->opno;
found_saop = true;
- /* count number of SA scans induced by indexBoundQuals only */
- if (alength > 1)
- num_sa_scans *= alength;
}
else if (IsA(clause, NullTest))
{
@@ -6805,9 +6781,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
Selectivity btreeSelectivity;
/*
- * If the index is partial, AND the index predicate with the
- * index-bound quals to produce a more accurate idea of the number of
- * rows covered by the bound conditions.
+ * AND the index predicate with the index-bound quals to produce a
+ * more accurate idea of the number of rows covered by the bound
+ * conditions
*/
selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
JOIN_INNER,
NULL);
numIndexTuples = btreeSelectivity * index->rel->tuples;
-
- /*
- * As in genericcostestimate(), we have to adjust for any
- * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
- * to integer.
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
}
/*
@@ -6832,6 +6801,43 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
genericcostestimate(root, path, loop_count, &costs);
+ /*
+ * Now compensate for btree's ability to efficiently execute scans with
+ * SAOP clauses.
+ *
+ * btree automatically combines individual ScalarArrayOpExpr primitive
+ * index scans whenever the tuples covered by the next set of array keys
+ * are close to tuples covered by the current set. This makes the final
+ * number of descents particularly difficult to estimate. However, btree
+ * scans never visit any single leaf page more than once. That puts a
+ * natural floor under the worst case number of descents.
+ *
+ * It's particularly important that we not wildly overestimate the number
+ * of descents needed for a clause list with several SAOPs -- the costs
+ * really aren't multiplicative in the way genericcostestimate expects. In
+ * general, most distinct combinations of SAOP keys will tend to not find
+ * any matching tuples. Furthermore, btree scans search for the next set
+ * of array keys using the next tuple in line, and so won't even need a
+ * direct comparison to eliminate most non-matching sets of array keys.
+ *
+ * Clamp the number of descents to the estimated number of leaf page
+ * visits. This is still fairly pessimistic, but tends to result in more
+ * accurate costing of scans with several SAOP clauses -- especially when
+ * each array has more than a few elements. The cost of adding additional
+ * array constants to a low-order SAOP column should saturate past a
+ * certain point (except where selectivity estimates continue to shift).
+ *
+ * Also clamp the number of descents to 1/3 the number of index pages.
+ * This avoids implausibly high estimates with low selectivity paths,
+ * where scans frequently require no more than one or two descents.
+ */
+ if (costs.num_sa_scans > 1)
+ {
+ costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+ costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+ costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+ }
+
/*
* Add a CPU-cost component to represent the costs of initial btree
* descent. We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6845,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* comparisons to descend a btree of N leaf tuples. We charge one
* cpu_operator_cost per comparison.
*
- * If there are ScalarArrayOpExprs, charge this once per SA scan. The
- * ones after the first one are not startup cost so far as the overall
- * plan is concerned, so add them only to "total" cost.
+ * If there are ScalarArrayOpExprs, charge this once per estimated
+ * primitive SA scan. The ones after the first one are not startup cost
+ * so far as the overall plan goes, so just add them to "total" cost.
*/
if (index->tuples > 1) /* avoid computing log(0) */
{
@@ -6858,7 +6864,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* in cases where only a single leaf page is expected to be visited. This
* cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
* touched. The number of such pages is btree tree height plus one (ie,
- * we charge for the leaf page too). As above, charge once per SA scan.
+ * we charge for the leaf page too). As above, charge once per estimated
+ * primitive SA scan.
*/
descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
costs.indexStartupCost += descentCost;
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..0dde21ca2 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
(1 row)
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
--------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------
Index Only Scan using tenk1_thous_tenthous on tenk1
- Index Cond: (thousand < 2)
- Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
SET enable_indexonlyscan = OFF;
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
- Sort Key: thousand
- -> Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
RESET enable_indexonlyscan;
--
-- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9b8638f28..20b69ff87 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7797,10 +7797,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
Merge Cond: (j1.id1 = j2.id1)
Join Filter: (j2.id2 = j1.id2)
-> Index Scan using j1_id1_idx on j1
- -> Index Only Scan using j2_pkey on j2
+ -> Index Scan using j2_id1_idx on j2
Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
- Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
select * from j1
inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..4f19fac54 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+
SET enable_indexonlyscan = OFF;
explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand desc, tenthous desc;
+
RESET enable_indexonlyscan;
--
--
2.40.1
On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values).
Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.
No notable changes here compared to v2.
--
Peter Geoghegan
Attachments:
v3-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/x-patch; name=v3-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload
From 2cff1dadb7903d49a2338b64b27178fa0bc51456 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v3] Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals. This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.
The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access). The general idea of giving the index AM this
additional context can be pushed a lot further, though.
Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime. The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys. This approach is
far more flexible, and can be far more efficient. Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.
Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals. Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute. These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.
nbtree is now strictly guaranteed to avoid all repeat accesses to any
individual leaf page, even in cases with inequalities on high order
columns (except when the scan direction changes, or the scan restarts).
We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses. The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans. The added cost from repeatedly descending the
index still matters, but it can never dominate.
An important goal of this work is to remove all ScalarArrayOpExpr clause
special cases from the planner -- ScalarArrayOpExpr clauses can now be
thought of a generalization of simple equality clauses (except when
costing index scans, perhaps). The planner no longer needs to generate
alternative index paths with filter quals/qpquals. We assume that true
SAOP index quals are strictly better than filter/qpquals, since the work
in nbtree guarantees that they'll be at least slightly faster.
Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements. They benefit indirectly.
The planner no longer shows any restraint around making SAOP clauses
into true nbtree index quals, which tends to result in significant
savings on heap page accesses. In general we never need visibility
checks to evaluate true index quals, whereas filter quals often need to
perform extra heap accesses, just to eliminate non-matching tuples
(expression evaluation is only safe with known visible tuples).
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
src/include/access/nbtree.h | 39 +-
src/backend/access/nbtree/nbtree.c | 65 +-
src/backend/access/nbtree/nbtsearch.c | 62 +-
src/backend/access/nbtree/nbtutils.c | 1367 ++++++++++++++++++--
src/backend/optimizer/path/indxpath.c | 64 +-
src/backend/utils/adt/selfuncs.c | 123 +-
doc/src/sgml/monitoring.sgml | 13 +
src/test/regress/expected/create_index.out | 61 +-
src/test/regress/expected/join.out | 5 +-
src/test/regress/sql/create_index.sql | 20 +-
10 files changed, 1506 insertions(+), 313 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6345e16d7..33db9b648 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
/* workspace for SK_SEARCHARRAY support */
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- bool arraysStarted; /* Started array keys, but have yet to "reach
- * past the end" of all arrays? */
int numArrayKeys; /* number of equality-type array keys (-1 if
* there are any unsatisfiable array keys) */
- int arrayKeyCount; /* count indicating number of array scan keys
- * processed */
+ bool needPrimScan; /* Perform another primitive scan? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
+ FmgrInfo *orderProcs; /* ORDER procs for equality constraint keys */
+ int numPrimScans; /* Running tally of # primitive index scans
+ * (used to coordinate parallel workers) */
MemoryContext arrayContext; /* scan-lifespan context for array data */
/* info about killed items if any (killedItems is NULL if never used) */
@@ -1080,6 +1080,29 @@ typedef struct BTScanOpaqueData
typedef BTScanOpaqueData *BTScanOpaque;
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the page high key. This must happen before the
+ * first call to _bt_checkkeys. _bt_checkkeys uses this information to manage
+ * advancement of the scan's array keys.
+ */
+typedef struct BTReadPageState
+{
+ /* Input parameters, set by _bt_readpage */
+ ScanDirection dir; /* current scan direction */
+ IndexTuple highkey; /* page high key, set by forward scans */
+
+ /* Output parameters, set by _bt_checkkeys */
+ bool continuescan; /* Terminate ongoing (primitive) index scan? */
+
+ /* Private _bt_checkkeys-managed state */
+ bool highkeychecked; /* high key checked against current
+ * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
/*
* We use some private sk_flags bits in preprocessed scan keys. We're allowed
* to use bits 16-31 (see skey.h). The uppermost bits are copied from the
@@ -1157,7 +1180,7 @@ extern bool btcanreturn(Relation index, int attno);
extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
@@ -1250,12 +1273,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6c5b5c69c..27fbb86d0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
* BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
* to a new page; some process can start doing that.
*
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit). Reached once per primitive index scan.
*/
typedef enum
{
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
BTPS_State btps_pageStatus; /* indicates whether next page is
* available for scan. see above for
* possible states of parallel scan. */
- int btps_arrayKeyCount; /* count indicating number of array scan
- * keys processed by parallel scan */
+ int btps_numPrimScans; /* count indicating number of primitive
+ * index scans (used with array keys) */
slock_t btps_mutex; /* protects above variables */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
} BTParallelScanDescData;
@@ -276,7 +276,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
if (res)
break;
/* ... otherwise see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
return res;
}
@@ -334,7 +334,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
}
}
/* Now see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
return ntids;
}
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->keyData = NULL;
so->arrayKeyData = NULL; /* assume no array keys for now */
- so->arraysStarted = false;
so->numArrayKeys = 0;
+ so->needPrimScan = false;
so->arrayKeys = NULL;
+ so->orderProcs = NULL;
so->arrayContext = NULL;
so->killedItems = NULL; /* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
}
so->markItemIndex = -1;
- so->arrayKeyCount = 0;
+ so->needPrimScan = false;
+ so->numPrimScans = 0;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
@@ -587,7 +589,7 @@ btinitparallelscan(void *target)
SpinLockInit(&bt_target->btps_mutex);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- bt_target->btps_arrayKeyCount = 0;
+ bt_target->btps_numPrimScans = 0;
ConditionVariableInit(&bt_target->btps_cv);
}
@@ -613,7 +615,7 @@ btparallelrescan(IndexScanDesc scan)
SpinLockAcquire(&btscan->btps_mutex);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount = 0;
+ btscan->btps_numPrimScans = 0;
SpinLockRelease(&btscan->btps_mutex);
}
@@ -624,7 +626,17 @@ btparallelrescan(IndexScanDesc scan)
*
* The return value is true if we successfully seized the scan and false
* if we did not. The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys. It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
+ *
+ * XXX This particular aspect of the patch is still at the proof of concept
+ * stage. Having this much available for review at least suggests that it'll
+ * be feasible to port the existing parallel scan array scan key stuff over to
+ * using a primitive index scan counter (as opposed to an array key counter)
+ * the top-level scan. I have yet to really put this code through its paces.
*
* If the return value is true, *pageno returns the next or current page
* of the scan (depending on the scan direction). An invalid block number
@@ -655,16 +667,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
SpinLockAcquire(&btscan->btps_mutex);
pageStatus = btscan->btps_pageStatus;
- if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ if (so->numPrimScans < btscan->btps_numPrimScans)
{
- /* Parallel scan has already advanced to a new set of scankeys. */
+ /* Top-level scan already moved on to next primitive index scan */
status = false;
}
else if (pageStatus == BTPARALLEL_DONE)
{
/*
- * We're done with this set of scankeys. This may be the end, or
- * there could be more sets to try.
+ * We're done with this primitive index scan. This might have
+ * been the final primitive index scan required, or the top-level
+ * index scan might require additional primitive scans.
*/
status = false;
}
@@ -696,9 +709,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
void
_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
{
+ BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
+ Assert(!so->needPrimScan);
+
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
@@ -732,12 +748,11 @@ _bt_parallel_done(IndexScanDesc scan)
parallel_scan->ps_offset);
/*
- * Mark the parallel scan as done for this combination of scan keys,
- * unless some other process already did so. See also
- * _bt_advance_array_keys.
+ * Mark the primitive index scan as done, unless some other process
+ * already did so. See also _bt_array_keys_remain.
*/
SpinLockAcquire(&btscan->btps_mutex);
- if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+ if (so->numPrimScans >= btscan->btps_numPrimScans &&
btscan->btps_pageStatus != BTPARALLEL_DONE)
{
btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -751,14 +766,14 @@ _bt_parallel_done(IndexScanDesc scan)
}
/*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- * keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ * counter when array keys are in use.
*
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
* scans.
*/
void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -767,13 +782,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- so->arrayKeyCount++;
+ so->numPrimScans++;
SpinLockAcquire(&btscan->btps_mutex);
if (btscan->btps_pageStatus == BTPARALLEL_DONE)
{
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount++;
+ btscan->btps_numPrimScans++;
}
SpinLockRelease(&btscan->btps_mutex);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 17ad89749..f15cd0870 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
*/
if (!so->qual_ok)
{
- /* Notify any other workers that we're done with this scan key. */
+ /* Notify any other workers that this primitive scan is done */
_bt_parallel_done(scan);
return false;
}
@@ -952,6 +952,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* one we use --- by definition, they are either redundant or
* contradictory.
*
+ * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+ * used to avoid prematurely stopping the scan when an array equality qual
+ * has its array keys advanced.
+ *
* Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
* If the index stores nulls at the end of the index we'll be starting
* from, and we have no boundary key for the column (which means the key
@@ -1536,9 +1540,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
BTPageOpaque opaque;
OffsetNumber minoff;
OffsetNumber maxoff;
+ BTReadPageState pstate;
int itemIndex;
- bool continuescan;
- int indnatts;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1558,8 +1561,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
}
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ pstate.dir = dir;
+ pstate.highkey = NULL;
+ pstate.continuescan = true; /* default assumption */
+ pstate.highkeychecked = false;
+
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
@@ -1594,6 +1600,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (ScanDirectionIsForward(dir))
{
+ /* SK_SEARCHARRAY scans must provide high key up front */
+ if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+ }
+
/* load items[] in ascending order */
itemIndex = 0;
@@ -1616,7 +1630,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+ if (_bt_checkkeys(scan, &pstate, itup, false))
{
/* tuple passes all scan key conditions */
if (!BTreeTupleIsPosting(itup))
@@ -1649,7 +1663,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
/* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
+ if (!pstate.continuescan)
break;
offnum = OffsetNumberNext(offnum);
@@ -1666,17 +1680,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* only appear on non-pivot tuples on the right sibling page are
* common.
*/
- if (continuescan && !P_RIGHTMOST(opaque))
+ if (pstate.continuescan && !P_RIGHTMOST(opaque))
{
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
+ IndexTuple itup;
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+ if (pstate.highkey)
+ itup = pstate.highkey;
+ else
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+ }
+
+ _bt_checkkeys(scan, &pstate, itup, true);
}
- if (!continuescan)
+ if (!pstate.continuescan)
so->currPos.moreRight = false;
Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1697,6 +1717,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
IndexTuple itup;
bool tuple_alive;
bool passes_quals;
+ bool finaltup = (offnum == minoff);
/*
* If the scan specifies not to return killed tuples, then we
@@ -1707,12 +1728,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* tuple on the page, we do check the index keys, to prevent
* uselessly advancing to the page to the left. This is similar
* to the high key optimization used by forward scans.
+ *
+ * Separately, _bt_checkkeys actually requires that we call it
+ * with the final non-pivot tuple from the page, if there's one
+ * (final processed tuple, or first tuple in offset number terms).
+ * We must indicate which particular tuple comes last, too.
*/
if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
{
Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
+ if (!finaltup)
{
+ Assert(offnum > minoff);
offnum = OffsetNumberPrev(offnum);
continue;
}
@@ -1724,8 +1751,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan);
+ passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup);
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions */
@@ -1764,7 +1790,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
}
- if (!continuescan)
+ if (!pstate.continuescan)
{
/* there can't be any more matches, so stop */
so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index e4528db47..292d2e322 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
typedef struct BTSortArrayContext
{
- FmgrInfo flinfo;
+ FmgrInfo *orderproc;
Oid collation;
bool reverse;
} BTSortArrayContext;
@@ -41,15 +41,33 @@ typedef struct BTSortArrayContext
static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
StrategyNumber strat,
Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
static int _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(ScanKey cur, FmgrInfo *orderproc,
+ Datum datum, bool null,
+ Datum arrdatum);
+static int _bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+ BTArrayKeyInfo *array, ScanKey cur,
+ FmgrInfo *orderproc, Datum datum, bool null,
+ int32 *final_result);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+ BTReadPageState *pstate,
+ IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir);
static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
ScanKey leftarg, ScanKey rightarg,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger);
static bool _bt_check_rowcompare(ScanKey skey,
IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
ScanDirection dir, bool *continuescan);
@@ -202,6 +220,21 @@ _bt_freestack(BTStack stack)
* array keys, it's sufficient to find the extreme element value and replace
* the whole array with that scalar value.
*
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved). It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
+ * We always sort and deduplicate arrays up-front for equality array keys.
+ * ScalarArrayOpExpr execution need only visit leaf pages that might contain
+ * matches exactly once, while preserving the sort order of the index. This
+ * isn't just about performance; it also avoids needing duplicate elimination
+ * of matching TIDs (we prefer deduplicating search keys once, up-front).
+ * Equality SK_SEARCHARRAY keys are disjuncts that we always process in
+ * index/key space order, which makes this general approach feasible. Every
+ * index tuple will match no more than one single distinct combination of
+ * equality-constrained keys (array keys and other equality keys).
+ *
* Note: the reason we need so->arrayKeyData, rather than just scribbling
* on scan->keyData, is that callers are permitted to call btrescan without
* supplying a new set of scankey data.
@@ -212,6 +245,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int numberOfKeys = scan->numberOfKeys;
int16 *indoption = scan->indexRelation->rd_indoption;
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
int numArrayKeys;
ScanKey cur;
int i;
@@ -265,6 +299,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+ so->orderProcs = (FmgrInfo *) palloc(nkeyatts * sizeof(FmgrInfo));
/* Now process each array key */
numArrayKeys = 0;
@@ -281,6 +316,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int j;
cur = &so->arrayKeyData[i];
+
+ /*
+ * Attributes with equality-type scan keys (including but not limited
+ * to array scan keys) will need a 3-way comparison function.
+ *
+ * XXX Clean this up some more. This repeats some of the same work
+ * when there are multiple scan keys for the same key column.
+ */
+ if (cur->sk_strategy == BTEqualStrategyNumber)
+ _bt_sort_cmp_func_setup(scan, cur);
+
if (!(cur->sk_flags & SK_SEARCHARRAY))
continue;
@@ -436,6 +482,42 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
return result;
}
+/*
+ * Look up the appropriate comparison function in the opfamily.
+ *
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
+ */
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ Oid elemtype;
+ RegProcedure cmp_proc;
+ FmgrInfo *orderproc = &so->orderProcs[skey->sk_attno - 1];
+
+ /*
+ * Determine the nominal datatype of the array elements. We have to
+ * support the convention that sk_subtype == InvalidOid means the opclass
+ * input type; this is a hack to simplify life for ScanKeyInit().
+ */
+ elemtype = skey->sk_subtype;
+ if (elemtype == InvalidOid)
+ elemtype = rel->rd_opcintype[skey->sk_attno - 1];
+
+ cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+ rel->rd_opcintype[skey->sk_attno - 1],
+ elemtype,
+ BTORDER_PROC);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+ BTORDER_PROC, elemtype, elemtype,
+ rel->rd_opfamily[skey->sk_attno - 1]);
+ fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
/*
* _bt_sort_array_elements() -- sort and de-dup array elements
*
@@ -450,42 +532,14 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems)
{
- Relation rel = scan->indexRelation;
- Oid elemtype;
- RegProcedure cmp_proc;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
BTSortArrayContext cxt;
if (nelems <= 1)
return nelems; /* no work to do */
- /*
- * Determine the nominal datatype of the array elements. We have to
- * support the convention that sk_subtype == InvalidOid means the opclass
- * input type; this is a hack to simplify life for ScanKeyInit().
- */
- elemtype = skey->sk_subtype;
- if (elemtype == InvalidOid)
- elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
- /*
- * Look up the appropriate comparison function in the opfamily.
- *
- * Note: it's possible that this would fail, if the opfamily is
- * incomplete, but it seems quite unlikely that an opfamily would omit
- * non-cross-type support functions for any datatype that it supports at
- * all.
- */
- cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
- elemtype,
- elemtype,
- BTORDER_PROC);
- if (!RegProcedureIsValid(cmp_proc))
- elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
- BTORDER_PROC, elemtype, elemtype,
- rel->rd_opfamily[skey->sk_attno - 1]);
-
/* Sort the array elements */
- fmgr_info(cmp_proc, &cxt.flinfo);
+ cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
cxt.collation = skey->sk_collation;
cxt.reverse = reverse;
qsort_arg(elems, nelems, sizeof(Datum),
@@ -507,7 +561,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
int32 compare;
- compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+ compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
cxt->collation,
da, db));
if (cxt->reverse)
@@ -515,6 +569,171 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
return compare;
}
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This code is loosely based on _bt_compare. However, there are some
+ * important differences.
+ *
+ * It is convenient to think of calling _bt_compare as comparing caller's
+ * insertion scankey to an index tuple. But our callers are not searching
+ * through the index at all -- they're searching through a local array of
+ * datums associated with a scan key (using values they've taken from an index
+ * tuple). This is a complete reversal of how things usually work, which can
+ * be confusing.
+ *
+ * Callers of this function should think of it as comparing "datum" (as well
+ * as "null") to "arrdatum". This is the same approach that _bt_compare takes
+ * in that both functions compare the value that they're searching for to one
+ * particular item used as a binary search pivot. (But it's the wrong way
+ * around if you think of it as "tuple values vs scan key values". So don't.)
+*/
+static inline int32
+_bt_compare_array_skey(ScanKey cur,
+ FmgrInfo *orderproc,
+ Datum datum,
+ bool null,
+ Datum arrdatum)
+{
+ int32 result = 0;
+
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ if (cur->sk_flags & SK_ISNULL) /* array/scan key is NULL */
+ {
+ if (null)
+ result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NULL "<" NOT_NULL */
+ else
+ result = -1; /* NULL ">" NOT_NULL */
+ }
+ else if (null) /* array/scan key is NOT_NULL and tuple item
+ * is NULL */
+ {
+ if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NOT_NULL ">" NULL */
+ else
+ result = 1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * Like _bt_compare, we need to be careful of cross-type comparisons,
+ * so the left value has to be the value that came from an index
+ * tuple. (Array scan keys cannot be cross-type, but other required
+ * scan keys that use an equal operator can be.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+ datum, arrdatum));
+
+ /*
+ * Unlike _bt_compare, we flip the sign when column is a DESC column
+ * (and *not* when column is ASC). This matches the approach taken by
+ * _bt_check_rowcompare, which performs similar three-way comparisons.
+ */
+ if (cur->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan). This allows searches against required scan key arrays to
+ * reuse the work of earlier searches, at least in many important cases.
+ * Array keys covering key space that the index scan already processed cannot
+ * possibly contain any matches.
+ *
+ * XXX There are several fairly obvious optimizations that we could apply here
+ * (e.g., precheck searches for earlier subsets of a larger array would help).
+ * Revisit this during the next round of performance validation.
+ *
+ * Returns an index to the first array element >= caller's datum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to searched-for datum.
+ */
+static int
+_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+ BTArrayKeyInfo *array, ScanKey cur,
+ FmgrInfo *orderproc, Datum datum, bool null,
+ int32 *final_result)
+{
+ int low_elem,
+ high_elem,
+ first_elem_dir,
+ result = 0;
+ bool knownequal = false;
+
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ first_elem_dir = 0;
+ low_elem = array->cur_elem;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ low_elem = 0;
+ }
+ else
+ {
+ first_elem_dir = array->num_elems - 1;
+ low_elem = 0;
+ high_elem = array->cur_elem;
+ if (cur_elem_start)
+ {
+ low_elem = 0;
+ high_elem = first_elem_dir;
+ }
+ }
+
+ while (high_elem > low_elem)
+ {
+ int mid_elem = low_elem + ((high_elem - low_elem) / 2);
+ Datum arrdatum = array->elem_values[mid_elem];
+
+ result = _bt_compare_array_skey(cur, orderproc, datum, null, arrdatum);
+
+ if (result == 0)
+ {
+ /*
+ * Each array was deduplicated during initial preprocessing, so
+ * there each element is guaranteed to be unique. We can quit as
+ * soon as we see an equal array, saving ourselves an extra
+ * comparison or two...
+ */
+ low_elem = mid_elem;
+ knownequal = true;
+ break;
+ }
+
+ if (result > 0)
+ low_elem = mid_elem + 1;
+ else
+ high_elem = mid_elem;
+ }
+
+ /*
+ * ... but our caller also cares about the position of the searched-for
+ * datum relative to the low_elem match we'll return. Make sure that we
+ * set *final_result to the result that comes from comparing low_elem's
+ * key value to the datum that caller had us search for.
+ */
+ if (!knownequal)
+ result = _bt_compare_array_skey(cur, orderproc, datum, null,
+ array->elem_values[low_elem]);
+
+ *final_result = result;
+
+ return low_elem;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -539,82 +758,22 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
curArrayKey->cur_elem = 0;
skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
}
-
- so->arraysStarted = true;
-}
-
-/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
- *
- * Returns true if there is another set of values to consider, false if not.
- * On true result, the scankeys are initialized with the next set of values.
- */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- bool found = false;
- int i;
-
- /*
- * We must advance the last array key most quickly, since it will
- * correspond to the lowest-order index column among the available
- * qualifications. This is necessary to ensure correct ordering of output
- * when there are multiple array keys.
- */
- for (i = so->numArrayKeys - 1; i >= 0; i--)
- {
- BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
- ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
- int cur_elem = curArrayKey->cur_elem;
- int num_elems = curArrayKey->num_elems;
-
- if (ScanDirectionIsBackward(dir))
- {
- if (--cur_elem < 0)
- {
- cur_elem = num_elems - 1;
- found = false; /* need to advance next array key */
- }
- else
- found = true;
- }
- else
- {
- if (++cur_elem >= num_elems)
- {
- cur_elem = 0;
- found = false; /* need to advance next array key */
- }
- else
- found = true;
- }
-
- curArrayKey->cur_elem = cur_elem;
- skey->sk_argument = curArrayKey->elem_values[cur_elem];
- if (found)
- break;
- }
-
- /* advance parallel scan */
- if (scan->parallel_scan != NULL)
- _bt_parallel_advance_array_keys(scan);
-
- /*
- * When no new array keys were found, the scan is "past the end" of the
- * array keys. _bt_start_array_keys can still "restart" the array keys if
- * a rescan is required.
- */
- if (!found)
- so->arraysStarted = false;
-
- return found;
}
/*
* _bt_mark_array_keys() -- Handle array keys during btmarkpos
*
* Save the current state of the array keys as the "mark" position.
+ *
+ * XXX The current set of array keys are not independent of the current scan
+ * position, so why treat them that way?
+ *
+ * We shouldn't even bother remembering the current array keys when btmarkpos
+ * is called. The array keys should be handled lazily instead. If and when
+ * btrestrpos is called, it can just set every array's cur_elem to the first
+ * element for the current scan direction. When _bt_advance_array_keys is
+ * reached (during the first call to _bt_checkkeys that follows), it will
+ * automatically search for the relevant array keys using caller's tuple.
*/
void
_bt_mark_array_keys(IndexScanDesc scan)
@@ -661,13 +820,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
* If we changed any keys, we must redo _bt_preprocess_keys. That might
* sound like overkill, but in cases with multiple keys per index column
* it seems necessary to do the full set of pushups.
- *
- * Also do this whenever the scan's set of array keys "wrapped around" at
- * the end of the last primitive index scan. There won't have been a call
- * to _bt_preprocess_keys from some other place following wrap around, so
- * we do it for ourselves.
*/
- if (changed || !so->arraysStarted)
+ if (changed)
{
_bt_preprocess_keys(scan);
/* The mark should have been set on a consistent set of keys... */
@@ -675,6 +829,785 @@ _bt_restore_array_keys(IndexScanDesc scan)
}
}
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) might need to advance the scan's array
+ * keys.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans). This means that it cannot possibly be time to advance the array
+ * keys just yet. _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfy our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans). This means that it might be time for our
+ * caller to advance the array keys to the next set.
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums. See
+ * comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ bool tuple_before_array_keys = false;
+ ScanKey cur;
+ int ntupatts = BTreeTupleGetNAtts(tuple, rel),
+ ikey;
+
+ Assert(so->qual_ok);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->numberOfKeys > 0);
+ Assert(!so->needPrimScan);
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ int attnum = cur->sk_attno;
+ FmgrInfo *orderproc;
+ Datum datum;
+ bool null,
+ skrequired;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required in the current scan
+ * direction
+ */
+ skrequired = ((ScanDirectionIsForward(dir) &&
+ (cur->sk_flags & SK_BT_REQFWD)) ||
+ (ScanDirectionIsBackward(dir) &&
+ (cur->sk_flags & SK_BT_REQBKWD)));
+
+ /*
+ * Unlike _bt_advance_array_keys, we never deal with any non-required
+ * array keys. Cases where skrequiredtrigger is set to false by
+ * _bt_check_compare should never call here. We are only called after
+ * _bt_check_compare provisionally indicated that the scan should be
+ * terminated due to a _required_ scan key not being satisfied.
+ *
+ * We expect _bt_check_compare to notice and report required scan keys
+ * before non-required ones. _bt_advance_array_keys might still have
+ * to advance non-required array keys in passing for a tuple that we
+ * were called for, but _bt_advance_array_keys doesn't rely on us to
+ * give it advanced notice of that.
+ */
+ if (!skrequired)
+ break;
+
+ if (attnum > ntupatts)
+ {
+ /*
+ * When we reach a high key's truncated attribute, assume that the
+ * tuple attribute's value is >= the scan's search-type scan keys
+ */
+ break;
+ }
+
+ datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+ orderproc = &so->orderProcs[attnum - 1];
+ result = _bt_compare_array_skey(cur, orderproc,
+ datum, null,
+ cur->sk_argument);
+
+ if (result != 0)
+ {
+ if (ScanDirectionIsForward(dir))
+ tuple_before_array_keys = result < 0;
+ else
+ tuple_before_array_keys = result > 0;
+
+ break;
+ }
+ }
+
+ return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if _bt_checkkeys determined that another primitive index scan
+ * must take place by calling _bt_first. Otherwise returns false, indicating
+ * that caller's top-level scan is now past the point where further matching
+ * index tuples can be found (for the current scan direction).
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ * All other scans should just call _bt_first once, no matter what.
+ *
+ * Top-level index scans executed via multiple primitive index scans must not
+ * fail to output index tuples in the usual order for the index -- just like
+ * any other index scan would. The state machine that manages the scan's
+ * array keys must only start primitive index scans when they cover key space
+ * strictly greater than the key space for tuples that the scan has already
+ * returned (or strictly less in the backwards scan case). Otherwise the scan
+ * could output the same index tuples more than once, or in the wrong order.
+ *
+ * This is managed by limiting the cases that can trigger new primitive index
+ * scans to those involving required array scan keys and/or other required
+ * scan keys that use the equality strategy. In particular, the state machine
+ * must not allow high order required scan keys using an inequality strategy
+ * (which are only required in one scan direction) to directly trigger a new
+ * primitive index scan that advances low order non-required array scan keys.
+ * For example, a query such as "SELECT thousand, tenthous FROM tenk1 WHERE
+ * thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand" whose execution
+ * involves a scan of an index on "(thousand, tenthous)" must perform no more
+ * than a single primitive index scan. Otherwise we risk outputting tuples in
+ * the wrong order. Array key values for the non-required scan key on the
+ * "tenthous" column must not dictate top-level scan order. Primitive index
+ * scans mustn't scan tuples already scanned by some earlier primitive scan.
+ *
+ * In fact, nbtree makes a stronger guarantee than is strictly necessary here:
+ * it guarantees that the top-level scan won't repeat any leaf page reads.
+ * (Actually, that can still happen when the scan is repositioned, or the scan
+ * direction changes -- but that's just as true with other types of scans.)
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(so->numArrayKeys);
+
+ /*
+ * Array keys are advanced within _bt_checkkeys when the scan reaches the
+ * leaf level (more precisely, they're advanced when the scan reaches the
+ * end of each distinct set of array elements). This process avoids
+ * repeat access to leaf pages (across multiple primitive index scans) by
+ * opportunistically advancing the scan's array keys when it allows the
+ * primitive index scan to find nearby matching tuples (or to eliminate
+ * array keys with no matching tuples from further consideration).
+ *
+ * _bt_checkkeys sets a simple flag variable that we check here. This
+ * tells us if we need to perform another primitive index scan for the
+ * now-current array keys or not. We'll unset the flag once again to
+ * acknowledge having started a new primitive scan (or we'll see that it
+ * isn't set and end the top-level scan right away).
+ *
+ * We cannot rely on _bt_first always reaching _bt_checkkeys here. There
+ * are various scenarios where that won't happen. For example, if the
+ * index is completely empty, then _bt_first won't get as far as calling
+ * _bt_readpage/_bt_checkkeys.
+ *
+ * We also don't expect _bt_checkkeys to be reached when searching for a
+ * non-existent value that happens to be higher than any existing value in
+ * the index. No _bt_checkkeys are expected when _bt_readpage reads the
+ * rightmost page during such a scan -- even a _bt_checkkeys call against
+ * the high key won't happen. There is an analogous issue for backwards
+ * scans that search for a value lower than all existing index tuples.
+ *
+ * We don't actually require special handling for these cases -- we don't
+ * need to be explicitly instructed to _not_ perform another primitive
+ * index scan. This is correct for all of the cases we've listed so far,
+ * which all involve primitive index scans that access pages "near the
+ * boundaries of the key space" (the leftmost page, the rightmost page, or
+ * an imaginary empty leaf root page). If _bt_checkkeys cannot be reached
+ * by a primitive index scan for one set of array keys, it follows that it
+ * also won't be reached for any later set of array keys.
+ *
+ * There is one exception: the case where _bt_first's _bt_preprocess_keys
+ * call determined that the scan's input scan keys can never be satisfied.
+ * That might be true for one set of array keys, but not the next set.
+ */
+ if (!so->qual_ok)
+ {
+ /*
+ * Qual can never be satisfied. Advance our array keys incrementally.
+ */
+ so->needPrimScan = false;
+ if (_bt_advance_array_keys_increment(scan, dir))
+ return true;
+ }
+
+ /* Time for another primitive index scan? */
+ if (so->needPrimScan)
+ {
+ /* Begin primitive index scan */
+ so->needPrimScan = false;
+
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_next_primitive_scan(scan);
+
+ return true;
+ }
+
+ /*
+ * No more primitive index scans. Just terminate the top-level scan.
+ */
+ _bt_advance_array_keys_to_end(scan, dir);
+
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_done(scan);
+
+ return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false. When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple. It
+ * is okay to let the second call set pstate.continuescan=false without
+ * further intervention, since we know that it can only be for a scan key that
+ * is required in one direction.
+ *
+ * When called with skrequiredtrigger, we don't expect to have to advance any
+ * non-required scan keys. We'll always set pstate.continuescan because a
+ * non-required scan key can never terminate the scan.
+ *
+ * Required array keys are always advanced to the highest element >= the
+ * corresponding tuple attribute values for its most significant non-equal
+ * column (or the next lowest set <= the tuple value during backwards scans).
+ * If we reach the end of the array keys for the current scan direction, we
+ * end the top-level index scan.
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys (or <= during backward
+ * scans). This must be established first, before calling here.
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple. We fall back on "incrementally" advancing the
+ * array keys in these cases, which involve inequality strategy scan keys.
+ * For example, with a composite index on (a, b) and a qual "WHERE a IN (3,5)
+ * AND b < 42", we'll be called for both "a" arry keys (keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42". Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called. The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)". The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time. This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys. We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are. We must not suppress cases where a call to
+ * _bt_check_compare sets continuescan=false for a required scan key that uses
+ * the equality strategy (only inequality-type scan keys get that treatment).
+ * We don't want to suppress the scan's termination when it's inappropriate.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0,
+ ntupatts = BTreeTupleGetNAtts(tuple, rel);
+ bool arrays_advanced = false,
+ arrays_done = false,
+ all_skrequired_atts_wrapped = skrequiredtrigger,
+ all_atts_equal = true;
+
+ Assert(so->numberOfKeys > 0);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->qual_ok);
+
+ /*
+ * Try to advance array keys via a series of binary searches.
+ *
+ * Loop iterates through the current scankeys (so->keyData, which were
+ * output by _bt_preprocess_keys earlier) and then sets input scan keys
+ * (so->arrayKeyData scan keys) to new array values. This sets things up
+ * for our call to _bt_preprocess_keys, which is where the current scan
+ * keys actually change.
+ *
+ * We need to do things this way because only current/preprocessed scan
+ * keys will be marked as required. It's also possible that the previous
+ * call to _bt_preprocess_keys eliminated one or more input scan keys
+ * (possibly array type scan keys) that were deemed to be redundant.
+ */
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array = NULL;
+ ScanKey skeyarray = NULL;
+ FmgrInfo *orderproc;
+ int attnum = cur->sk_attno,
+ first_elem_dir,
+ final_elem_dir,
+ set_elem;
+ Datum datum;
+ bool skrequired,
+ null;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required in the current scan
+ * direction
+ */
+ skrequired = ((ScanDirectionIsForward(dir) &&
+ (cur->sk_flags & SK_BT_REQFWD)) ||
+ (ScanDirectionIsBackward(dir) &&
+ (cur->sk_flags & SK_BT_REQBKWD)));
+
+ /*
+ * Optimization: we don't have to advance remaining non-required array
+ * keys when we already know that tuple won't be returned by the scan.
+ *
+ * Deliberately check this both here and after the binary search.
+ */
+ if (!skrequired && !all_atts_equal)
+ break;
+
+ /*
+ * We need to check required non-array scan keys (that use the equal
+ * strategy), as well as required and non-required array scan keys
+ * (also limited to those that use the equal strategy, since array
+ * inequalities degenerate into a simple comparison).
+ *
+ * Perform initial set up for this scan key. If it is backed by an
+ * array then we need to set variables describing the current position
+ * in the array.
+ */
+ orderproc = &so->orderProcs[attnum - 1];
+ first_elem_dir = final_elem_dir = 0; /* keep compiler quiet */
+ if (cur->sk_flags & SK_SEARCHARRAY)
+ {
+ /* Set up array comparison function */
+ Assert(arrayidx < so->numArrayKeys);
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+
+ /*
+ * It's possible that _bt_preprocess_keys determined that an
+ * individual array scan key wasn't required in so->keyData for
+ * the ongoing primitive index scan due to it being redundant or
+ * contradictory (the current array value might be redundant next
+ * to some other scan key on the same attribute). Deal with that.
+ */
+ if (unlikely(skeyarray->sk_attno != attnum))
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY = false;
+
+ for (; arrayidx < so->numArrayKeys; arrayidx++)
+ {
+ array = &so->arrayKeys[arrayidx];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+ if (skeyarray->sk_attno == attnum)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ Assert(found);
+ }
+
+ /* Proactively set up state used to handle array wraparound */
+ if (ScanDirectionIsForward(dir))
+ {
+ first_elem_dir = 0;
+ final_elem_dir = array->num_elems - 1;
+ }
+ else
+ {
+ first_elem_dir = array->num_elems - 1;
+ final_elem_dir = 0;
+ }
+ }
+ else if (attnum > ntupatts)
+ {
+ /*
+ * Nothing needs to be done when we have a truncated attribute
+ * (possible when caller's tuple is a page high key) and a
+ * non-array scan key
+ */
+ Assert(ScanDirectionIsForward(dir));
+ continue;
+ }
+
+ /*
+ * Here we perform steps for any required scan keys after the first
+ * non-equal required scan key. The first scan key must have been set
+ * to a value > the value from the tuple back when we dealt with it
+ * (or, for a backwards scan, to a value < the value from the tuple).
+ * That needs to "cascade" to lower-order array scan keys. They must
+ * be set to the first array element for the current scan direction.
+ *
+ * We're still setting the keys to values >= the tuple here -- it just
+ * needs to work for the tuple as a whole. For example, when a tuple
+ * "(a, b) = (42, 5)" advances the array keys on "a" from 40 to 45, we
+ * must also set "b" to whatever the first array element for "b" is.
+ * It would be wrong to allow "b" to be set to a value from the tuple,
+ * since the value is actually from a different part of the key space.
+ *
+ * Also defensively do this with truncated attributes when caller's
+ * tuple is a page high key.
+ */
+ if (array && ((arrays_advanced && !all_atts_equal) ||
+ attnum > ntupatts))
+ {
+ /* Shouldn't reach this far for a non-required scan key */
+ Assert(skrequired && skrequiredtrigger && attnum > 1);
+
+ /*
+ * We set the array to the first element (if needed) here, and we
+ * don't unset all_required_atts_wrapped. This array therefore
+ * counts as a wrapped array when we go on to determine if all of
+ * the required arrays have wrapped (after this loop).
+ */
+ if (array->cur_elem != first_elem_dir)
+ {
+ array->cur_elem = first_elem_dir;
+ skeyarray->sk_argument = array->elem_values[first_elem_dir];
+ arrays_advanced = true;
+ }
+
+ continue;
+ }
+
+ /*
+ * Going to compare scan key to corresponding tuple attribute value
+ */
+ datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+ if (!array)
+ {
+ if (!skrequired || !all_atts_equal)
+ continue;
+
+ /*
+ * This is a required non-array scan key that uses the equal
+ * strategy. See header comments for an explanation of why we
+ * need to do this.
+ */
+ result = _bt_compare_array_skey(cur, orderproc, datum, null,
+ cur->sk_argument);
+
+ if (result != 0)
+ {
+ /*
+ * tuple attribute value is > scan key value (or < scan key
+ * value in the backward scan case).
+ */
+ all_atts_equal = false;
+ break;
+ }
+
+ continue;
+ }
+
+ /*
+ * Binary search for an array key >= the tuple value, which we'll then
+ * set as our current array key (or <= the tuple value if this is a
+ * backward scan).
+ *
+ * The binary search excludes array keys that we've already processed
+ * from consideration, except with a non-required scan key's array.
+ * This is not just an optimization -- it's important for correctness.
+ * It is crucial that required array scan keys only have their array
+ * keys advanced in the current scan direction. We need to advance
+ * required array keys in lock step with the index scan.
+ *
+ * Note in particular that arrays_advanced must only be set when the
+ * array is advanced to a key >= the existing key, or <= for a
+ * backwards scan. (Though see notes about wraparound below.)
+ */
+ set_elem = _bt_binsrch_array_skey(dir, (!skrequired || arrays_advanced),
+ array, cur, orderproc, datum, null,
+ &result);
+
+ /*
+ * Maintain the state that tracks whether all attribute from the tuple
+ * are equal to the array keys that we've set as current (or existing
+ * array keys set during earlier calls here).
+ */
+ if (result != 0)
+ all_atts_equal = false;
+
+ /*
+ * Optimization: we don't have to advance remaining non-required array
+ * keys when we already know that tuple won't be returned by the scan.
+ * Quit before setting the array keys to avoid _bt_preprocess_keys.
+ *
+ * Deliberately check this both before and after the binary search.
+ */
+ if (!skrequired && !all_atts_equal)
+ break;
+
+ /*
+ * If the binary search indicates that the key space for this tuple
+ * attribute value is > the key value from the final element in the
+ * array (final for the current scan direction), we handle it by
+ * wrapping around to the first element of the array.
+ *
+ * Wrapping around simplifies advancement with a multi-column index by
+ * allowing us to treat wrapping a column as advancing the column. We
+ * preserve the invariant that a required scan key's array may only be
+ * ratcheted forward (backwards when the scan direction is backwards),
+ * while still always being able to "advance" the array at this point.
+ */
+ if (set_elem == final_elem_dir &&
+ ((ScanDirectionIsForward(dir) && result > 0) ||
+ (ScanDirectionIsBackward(dir) && result < 0)))
+ {
+ /* Perform wraparound */
+ set_elem = first_elem_dir;
+ }
+ else if (skrequired)
+ {
+ /* Won't call _bt_advance_array_keys_to_end later */
+ all_skrequired_atts_wrapped = false;
+ }
+
+ Assert(set_elem >= 0 && set_elem < array->num_elems);
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ skeyarray->sk_argument = array->elem_values[set_elem];
+ arrays_advanced = true;
+
+ /*
+ * We shouldn't have to advance a required array when called due
+ * to _bt_check_compare determining that a non-required array
+ * needs to be advanced. We expect _bt_check_compare to notice
+ * and report required scan keys before non-required ones.
+ */
+ Assert(skrequiredtrigger || !skrequired);
+ }
+ }
+
+ if (!skrequiredtrigger)
+ {
+ /*
+ * Failing to satisfy a non-required array scan key shouldn't ever
+ * result in terminating the (primitive) index scan
+ */
+ }
+ else if (all_skrequired_atts_wrapped)
+ {
+ /*
+ * The binary searches for each tuple's attribute value in the scan
+ * key's corresponding SK_SEARCHARRAY array all found that the tuple's
+ * value are "past the end" of the key space covered by each array
+ */
+ _bt_advance_array_keys_to_end(scan, dir);
+ arrays_done = true;
+ all_atts_equal = false; /* at least not now */
+ }
+ else if (!arrays_advanced)
+ {
+ /*
+ * We must always advance the array keys by at least one increment
+ * (except when called to advance a non-required scan key's array).
+ *
+ * We need this fallback for cases where the existing array keys and
+ * existing required equal-strategy scan keys were fully equal to the
+ * tuple. _bt_check_compare may have set continuescan=false due to an
+ * inequality terminating the scan, which we don't deal with directly.
+ * (See function's header comments for an example.)
+ */
+ if (_bt_advance_array_keys_increment(scan, dir))
+ arrays_advanced = true;
+ else
+ arrays_done = true;
+ all_atts_equal = false; /* at least not now */
+ }
+
+ /*
+ * Might make sense to recheck the high key later on in cases where we
+ * just advanced the keys (unless we were just called to advance the
+ * scan's non-required array keys)
+ */
+ if (arrays_advanced && skrequiredtrigger)
+ pstate->highkeychecked = false;
+
+ /*
+ * If we changed the array keys without exhausting all array keys then we
+ * need to preprocess our search-type scan keys once more
+ */
+ Assert(skrequiredtrigger || !arrays_done);
+ if (arrays_advanced && !arrays_done)
+ {
+ /*
+ * XXX Think about buffer-lock-held hazards here some more.
+ *
+ * In almost all interesting cases we only really need to copy over
+ * the array values (from "so->arrayKeyData" to "so->keyData"). But
+ * there are at least some cases where performing the full set of push
+ * ups here (or close to it) might add value over just doing it for
+ * the main _bt_first call.
+ */
+ _bt_preprocess_keys(scan);
+ }
+
+ /* Are we now done with the top-level scan (barring a btrescan)? */
+ Assert(!so->needPrimScan);
+ if (!so->qual_ok)
+ {
+ /*
+ * Increment array keys and start a new primitive index scan if
+ * _bt_preprocess_keys() discovered that the scan keys can never be
+ * satisfied (eg, x == 2 AND x in (1, 2, 3) for array keys 1 and 2).
+ *
+ * Note: There is similar handling in _bt_array_keys_remain, which
+ * must advance the array keys without consulting us in this one case.
+ */
+ Assert(skrequiredtrigger);
+
+ pstate->continuescan = false;
+ pstate->highkeychecked = true;
+ all_atts_equal = false; /* at least not now */
+
+ if (_bt_advance_array_keys_increment(scan, dir))
+ so->needPrimScan = true;
+ }
+ else if (!skrequiredtrigger)
+ {
+ /* Not when we failed to satisfy a non-required scan key, ever */
+ Assert(!arrays_done);
+ pstate->continuescan = true;
+ }
+ else if (arrays_done)
+ {
+ /*
+ * Yep -- this primitive scan was our last
+ */
+ Assert(!all_atts_equal);
+ pstate->continuescan = false;
+ }
+ else if (!all_atts_equal)
+ {
+ /*
+ * Not done. The top-level index scan (and primitive index scan) will
+ * continue, since the array keys advanced.
+ */
+ Assert(arrays_advanced);
+ pstate->continuescan = true;
+
+ /*
+ * Some required array keys might have wrapped around during this
+ * call, but it can't have been the most significant array scan key.
+ */
+ Assert(!all_skrequired_atts_wrapped);
+ }
+ else
+ {
+ /*
+ * Not done. A second call to _bt_check_compare must now take place.
+ * It will make the final decision on setting continuescan.
+ */
+ }
+
+ return all_atts_equal;
+}
+
+/*
+ * Advance the array keys by a single increment in the current scan direction
+ */
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found = false;
+ int i;
+
+ Assert(!so->needPrimScan);
+
+ /*
+ * We must advance the last array key most quickly, since it will
+ * correspond to the lowest-order index column among the available
+ * qualifications. This is necessary to ensure correct ordering of output
+ * when there are multiple array keys.
+ */
+ for (i = so->numArrayKeys - 1; i >= 0; i--)
+ {
+ BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
+ int cur_elem = curArrayKey->cur_elem;
+ int num_elems = curArrayKey->num_elems;
+
+ if (ScanDirectionIsBackward(dir))
+ {
+ if (--cur_elem < 0)
+ {
+ cur_elem = num_elems - 1;
+ found = false; /* need to advance next array key */
+ }
+ else
+ found = true;
+ }
+ else
+ {
+ if (++cur_elem >= num_elems)
+ {
+ cur_elem = 0;
+ found = false; /* need to advance next array key */
+ }
+ else
+ found = true;
+ }
+
+ curArrayKey->cur_elem = cur_elem;
+ skey->sk_argument = curArrayKey->elem_values[cur_elem];
+ if (found)
+ break;
+ }
+
+ return found;
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ */
+static void
+_bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(so->numArrayKeys);
+ Assert(!so->needPrimScan);
+
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
+ int reset_elem;
+
+ if (ScanDirectionIsForward(dir))
+ reset_elem = curArrayKey->num_elems - 1;
+ else
+ reset_elem = 0;
+
+ if (curArrayKey->cur_elem != reset_elem)
+ {
+ curArrayKey->cur_elem = reset_elem;
+ skey->sk_argument = curArrayKey->elem_values[reset_elem];
+ }
+ }
+}
/*
* _bt_preprocess_keys() -- Preprocess scan keys
@@ -1360,38 +2293,204 @@ _bt_mark_scankey_required(ScanKey skey)
*
* Return true if so, false if not. If the tuple fails to pass the qual,
* we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
+ * this tuple, and set pstate.continuescan accordingly. See comments for
* _bt_preprocess_keys(), above, about how this is done.
*
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the high
+ * key early, before we've expended too much effort on comparing tuples that
+ * cannot possibly be matches for any set of array keys. This is just an
+ * optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate. These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards). Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction). Any other order will
+ * lead to inconsistent array key state.
*
* scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
* tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
*/
bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup)
+{
+ TupleDesc tupdesc = RelationGetDescr(scan->indexRelation);
+ int natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool res;
+ bool skrequiredtrigger;
+
+ Assert(so->qual_ok);
+ Assert(pstate->continuescan);
+ Assert(!so->needPrimScan);
+
+ res = _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+ tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger);
+
+ /*
+ * Only one _bt_check_compare call is required in the common case where
+ * there are no equality-type array scan keys.
+ *
+ * When there are array scan keys then we can still accept the first
+ * answer we get from _bt_check_compare when continuescan wasn't unset.
+ */
+ if (!so->numArrayKeys || pstate->continuescan)
+ return res;
+
+ /*
+ * _bt_check_compare set continuescan=false in the presence of equality
+ * type array keys. It's possible that we haven't reached the start of
+ * the array keys just yet. It's also possible that we need to advance
+ * the array keys now. (Or perhaps we really do need to terminate the
+ * top-level scan.)
+ */
+ pstate->continuescan = true; /* new initial assumption */
+
+ if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+ {
+ /*
+ * Tuple is still < the current array scan key values (as well as
+ * other equality type scan keys) if this is a forward scan.
+ * (Backwards scans reach here with a tuple > equality constraints.)
+ * We must now consider how to proceed with the ongoing primitive
+ * index scan.
+ *
+ * Should _bt_readpage continue with this page for now, in the hope of
+ * finding tuples whose key space is covered by the current array keys
+ * before too long? Or, should it give up and start a new primitive
+ * index scan instead?
+ *
+ * Our policy is to terminate the primitive index scan at the end of
+ * the current page if the current (most recently advanced) array keys
+ * don't cover the final tuple from the page. This policy is fairly
+ * conservative.
+ *
+ * Note: In some cases we're effectively speculating that the next
+ * sibling leaf page will have tuples that are covered by the key
+ * space of our array keys (the current set or some nearby set), based
+ * on a cue from the current page's final tuple. There is at least a
+ * non-zero risk of wasting a page access -- we could gamble and lose.
+ * The details of all this are handled within _bt_advance_array_keys.
+ */
+ if (finaltup || (!pstate->highkeychecked && pstate->highkey &&
+ _bt_tuple_before_array_skeys(scan, pstate,
+ pstate->highkey)))
+ {
+ /*
+ * This is the final tuple (the high key for forward scans, or the
+ * tuple at the first offset number for backward scans), but it is
+ * still before the current array keys. As such, we're unwilling
+ * to allow the current primitive index scan to continue to the
+ * next leaf page.
+ *
+ * Start a new primitive index scan. The next primitive index
+ * scan (in the next _bt_first call) is expected to reposition the
+ * scan to some much later leaf page. (If we had a good reason to
+ * think that the next leaf page that will be scanned will turn
+ * out to be close to our current position, then we wouldn't be
+ * starting another primitive index scan.)
+ *
+ * Note: _bt_readpage stashes the page high key, which allows us
+ * to make this check early (for forward scans). We thereby avoid
+ * scanning very many extra tuples on the page. This is just an
+ * optimization; skipping these useless comparisons should never
+ * change our final conclusion about what the scan should do next.
+ */
+ pstate->continuescan = false;
+ so->needPrimScan = true;
+ }
+ else if (!finaltup && pstate->highkey)
+ {
+ /*
+ * Remember that the high key has been checked with this
+ * particular set of array keys.
+ *
+ * It might make sense to check the same high key again at some
+ * point during the ongoing _bt_readpage-wise scan of this page.
+ * But it is definitely wasteful to repeat the same high key check
+ * before the array keys are advanced by some later tuple.
+ */
+ pstate->highkeychecked = true;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual
+ */
+ return false;
+ }
+
+ /*
+ * Caller's tuple is >= the current set of array keys and other equality
+ * constraint scan keys (or <= if this is a backwards scans).
+ *
+ * It might be time to advance the array keys to the next set. Try doing
+ * that now, while determining in passing if the tuple matches the newly
+ * advanced set of array keys (if we've any left).
+ *
+ * This call will also set continuescan for us (or tells us to perform
+ * another _bt_check_compare call, which then sets continuescan for us).
+ */
+ if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+ {
+ /*
+ * Tuple doesn't match any later array keys, either (for one or more
+ * array type scan keys marked as required). Give up on this tuple
+ * being a match. (Call may have also terminated the primitive scan,
+ * or the top-level scan.)
+ */
+ return false;
+ }
+
+ /*
+ * Advanced array keys to values that are exact matches for corresponding
+ * attribute values from the tuple.
+ *
+ * It's fairly likely that the tuple satisfies all index scan conditions
+ * at this point, but we need confirmation of that. We also need to give
+ * _bt_check_compare a real opportunity to end the top-level index scan by
+ * setting continuescan=false. (_bt_advance_array_keys cannot deal with
+ * inequality strategy scan keys; we need _bt_check_compare for those.)
+ */
+ return _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+ tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys. It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan. It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger)
{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
int ikey;
ScanKey key;
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
*continuescan = true; /* default assumption */
+ *skrequiredtrigger = true; /* default assumption */
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ for (key = keyData, ikey = 0; ikey < keysz; key++, ikey++)
{
Datum datum;
bool isNull;
@@ -1512,6 +2611,10 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* qual fails, it is critical that equality quals be used for the
* initial positioning in _bt_first() when they are available. See
* comments in _bt_first().
+ *
+ * Scans with equality-type array scan keys run into a similar
+ * problem whenever they advance the array keys. Our caller uses
+ * _bt_tuple_before_array_skeys to avoid the problem there.
*/
if ((key->sk_flags & SK_BT_REQFWD) &&
ScanDirectionIsForward(dir))
@@ -1520,6 +2623,14 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
ScanDirectionIsBackward(dir))
*continuescan = false;
+ if ((key->sk_flags & SK_SEARCHARRAY) &&
+ key->sk_strategy == BTEqualStrategyNumber)
+ {
+ if (*continuescan)
+ *skrequiredtrigger = false;
+ *continuescan = false;
+ }
+
/*
* In any case, this indextuple doesn't match the qual.
*/
@@ -1538,7 +2649,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* it's not possible for any future tuples in the current scan direction
* to pass the qual.
*
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
*/
static bool
_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..f04ca1ee9 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop);
+ bool *skip_nonnative_saop);
static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
List *clauses, List *other_clauses);
static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
* index AM supports them natively, we should just include them in simple
* index paths. If not, we should exclude them while building simple index
* paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
*/
static void
get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
{
List *indexpaths;
bool skip_nonnative_saop = false;
- bool skip_lower_saop = false;
ListCell *lc;
/*
* Build simple index paths using the clauses. Allow ScalarArrayOpExpr
- * clauses only if the index AM supports them natively, and skip any such
- * clauses for index columns after the first (so that we produce ordered
- * paths if possible).
+ * clauses only if the index AM supports them natively.
*/
indexpaths = build_index_paths(root, rel,
index, clauses,
index->predOK,
ST_ANYSCAN,
- &skip_nonnative_saop,
- &skip_lower_saop);
-
- /*
- * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
- * that supports them, then try again including those clauses. This will
- * produce paths with more selectivity but no ordering.
- */
- if (skip_lower_saop)
- {
- indexpaths = list_concat(indexpaths,
- build_index_paths(root, rel,
- index, clauses,
- index->predOK,
- ST_ANYSCAN,
- &skip_nonnative_saop,
- NULL));
- }
+ &skip_nonnative_saop);
/*
* Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
index, clauses,
false,
ST_BITMAPSCAN,
- NULL,
NULL);
*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
* to true if we found any such clauses (caller must initialize the variable
* to false). If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
*
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false). If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
* 'rel' is the index's heap relation
* 'index' is the index for which we want to generate paths
* 'clauses' is the collection of indexable clauses (IndexClause nodes)
* 'useful_predicate' indicates whether the index has a useful predicate
* 'scantype' indicates whether we need plain or bitmap scan support
* 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
*/
static List *
build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop)
+ bool *skip_nonnative_saop)
{
List *result = NIL;
IndexPath *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
- bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
bool index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
* on by btree and possibly other places.) The list can be empty, if the
* index AM allows that.
*
- * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
- * index clause for a non-first index column. This prevents us from
- * assuming that the scan result is ordered. (Actually, the result is
- * still ordered if there are equality constraints for all earlier
- * columns, but it seems too expensive and non-modular for this code to be
- * aware of that refinement.)
- *
* We also build a Relids set showing which outer rels are required by the
* selected clauses. Any lateral_relids are included in that, but not
* otherwise accounted for.
*/
index_clauses = NIL;
- found_lower_saop_clause = false;
outer_relids = bms_copy(rel->lateral_relids);
for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
{
@@ -917,16 +876,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/* Caller had better intend this only for bitmap scan */
Assert(scantype == ST_BITMAPSCAN);
}
- if (indexcol > 0)
- {
- if (skip_lower_saop)
- {
- /* Caller doesn't want to lose index ordering */
- *skip_lower_saop = true;
- continue;
- }
- found_lower_saop_clause = true;
- }
}
/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/*
* 2. Compute pathkeys describing index's ordering, if any, then see how
* many of them are actually useful for this query. This is not relevant
- * if we are only trying to build bitmap indexscans, nor if we have to
- * assume the scan is unordered.
+ * if we are only trying to build bitmap indexscans.
*/
pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
- !found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
index, &clauseset,
useful_predicate,
ST_BITMAPSCAN,
- NULL,
NULL);
result = list_concat(result, indexpaths);
}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..c796b53a6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
double numIndexTuples;
double spc_random_page_cost;
double num_sa_scans;
- double num_outer_scans;
- double num_scans;
double qual_op_cost;
double qual_arg_cost;
List *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
/*
* Check for ScalarArrayOpExpr index quals, and estimate the number of
- * index scans that will be performed.
+ * primitive index scans that will be performed for caller
*/
num_sa_scans = 1;
foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
*/
numIndexTuples = costs->numIndexTuples;
if (numIndexTuples <= 0.0)
- {
numIndexTuples = indexSelectivity * index->rel->tuples;
- /*
- * The above calculation counts all the tuples visited across all
- * scans induced by ScalarArrayOpExpr nodes. We want to consider the
- * average per-indexscan number, so adjust. This is a handy place to
- * round to integer, too. (If caller supplied tuple estimate, it's
- * responsible for handling these considerations.)
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
- }
-
/*
* We can bound the number of tuples by the index size in any case. Also,
* always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
*
* The above calculations are all per-index-scan. However, if we are in a
* nestloop inner scan, we can expect the scan to be repeated (with
- * different search keys) for each row of the outer relation. Likewise,
- * ScalarArrayOpExpr quals result in multiple index scans. This creates
- * the potential for cache effects to reduce the number of disk page
- * fetches needed. We want to estimate the average per-scan I/O cost in
- * the presence of caching.
+ * different search keys) for each row of the outer relation. This
+ * creates the potential for cache effects to reduce the number of disk
+ * page fetches needed. We want to estimate the average per-scan I/O cost
+ * in the presence of caching.
*
* We use the Mackert-Lohman formula (see costsize.c for details) to
* estimate the total number of page fetches that occur. While this
* wasn't what it was designed for, it seems a reasonable model anyway.
* Note that we are counting pages not tuples anymore, so we take N = T =
* index size, as if there were one "tuple" per page.
+ *
+ * Note: we assume that there will be no repeat index page fetches across
+ * ScalarArrayOpExpr primitive scans from the same logical index scan.
+ * This is guaranteed to be true for btree indexes, but is very optimistic
+ * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+ * However, these same index AMs also accept our default pessimistic
+ * approach to counting num_sa_scans (btree caller caps this), so we don't
+ * expect the final indexTotalCost to be wildly over-optimistic.
*/
- num_outer_scans = loop_count;
- num_scans = num_sa_scans * num_outer_scans;
-
- if (num_scans > 1)
+ if (loop_count > 1)
{
double pages_fetched;
/* total page fetches ignoring cache effects */
- pages_fetched = numIndexPages * num_scans;
+ pages_fetched = numIndexPages * loop_count;
/* use Mackert and Lohman formula to adjust for cache effects */
pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
/*
* Now compute the total disk access cost, and then report a pro-rated
- * share for each outer scan. (Don't pro-rate for ScalarArrayOpExpr,
- * since that's internal to the indexscan.)
+ * share for each outer scan
*/
- indexTotalCost = (pages_fetched * spc_random_page_cost)
- / num_outer_scans;
+ indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
}
else
{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
* evaluated once at the start of the scan to reduce them to runtime keys
* to pass to the index AM (see nodeIndexscan.c). We model the per-tuple
* CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
- * indexqual operator. Because we have numIndexTuples as a per-scan
- * number, we have to multiply by num_sa_scans to get the correct result
- * for ScalarArrayOpExpr cases. Similarly add in costs for any index
- * ORDER BY expressions.
+ * indexqual operator. Similarly add in costs for any index ORDER BY
+ * expressions.
*
* Note: this neglects the possible costs of rechecking lossy operators.
* Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
indexStartupCost = qual_arg_cost;
indexTotalCost += qual_arg_cost;
- indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+ indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
/*
* Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
bool eqQualHere;
bool found_saop;
bool found_is_null_op;
- double num_sa_scans;
ListCell *lc;
/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
*
* For a RowCompareExpr, we consider only the first column, just as
* rowcomparesel() does.
- *
- * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
- * index scans not one, but the ScalarArrayOpExpr's operator can be
- * considered to act the same as it normally does.
*/
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
found_saop = false;
found_is_null_op = false;
- num_sa_scans = 1;
foreach(lc, path->indexclauses)
{
IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
else if (IsA(clause, ScalarArrayOpExpr))
{
ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
- Node *other_operand = (Node *) lsecond(saop->args);
- int alength = estimate_array_length(other_operand);
clause_op = saop->opno;
found_saop = true;
- /* count number of SA scans induced by indexBoundQuals only */
- if (alength > 1)
- num_sa_scans *= alength;
}
else if (IsA(clause, NullTest))
{
@@ -6805,9 +6781,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
Selectivity btreeSelectivity;
/*
- * If the index is partial, AND the index predicate with the
- * index-bound quals to produce a more accurate idea of the number of
- * rows covered by the bound conditions.
+ * AND the index predicate with the index-bound quals to produce a
+ * more accurate idea of the number of rows covered by the bound
+ * conditions
*/
selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
JOIN_INNER,
NULL);
numIndexTuples = btreeSelectivity * index->rel->tuples;
-
- /*
- * As in genericcostestimate(), we have to adjust for any
- * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
- * to integer.
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
}
/*
@@ -6832,6 +6801,43 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
genericcostestimate(root, path, loop_count, &costs);
+ /*
+ * Now compensate for btree's ability to efficiently execute scans with
+ * SAOP clauses.
+ *
+ * btree automatically combines individual ScalarArrayOpExpr primitive
+ * index scans whenever the tuples covered by the next set of array keys
+ * are close to tuples covered by the current set. This makes the final
+ * number of descents particularly difficult to estimate. However, btree
+ * scans never visit any single leaf page more than once. That puts a
+ * natural floor under the worst case number of descents.
+ *
+ * It's particularly important that we not wildly overestimate the number
+ * of descents needed for a clause list with several SAOPs -- the costs
+ * really aren't multiplicative in the way genericcostestimate expects. In
+ * general, most distinct combinations of SAOP keys will tend to not find
+ * any matching tuples. Furthermore, btree scans search for the next set
+ * of array keys using the next tuple in line, and so won't even need a
+ * direct comparison to eliminate most non-matching sets of array keys.
+ *
+ * Clamp the number of descents to the estimated number of leaf page
+ * visits. This is still fairly pessimistic, but tends to result in more
+ * accurate costing of scans with several SAOP clauses -- especially when
+ * each array has more than a few elements. The cost of adding additional
+ * array constants to a low-order SAOP column should saturate past a
+ * certain point (except where selectivity estimates continue to shift).
+ *
+ * Also clamp the number of descents to 1/3 the number of index pages.
+ * This avoids implausibly high estimates with low selectivity paths,
+ * where scans frequently require no more than one or two descents.
+ */
+ if (costs.num_sa_scans > 1)
+ {
+ costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+ costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+ costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+ }
+
/*
* Add a CPU-cost component to represent the costs of initial btree
* descent. We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6845,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* comparisons to descend a btree of N leaf tuples. We charge one
* cpu_operator_cost per comparison.
*
- * If there are ScalarArrayOpExprs, charge this once per SA scan. The
- * ones after the first one are not startup cost so far as the overall
- * plan is concerned, so add them only to "total" cost.
+ * If there are ScalarArrayOpExprs, charge this once per estimated
+ * primitive SA scan. The ones after the first one are not startup cost
+ * so far as the overall plan goes, so just add them to "total" cost.
*/
if (index->tuples > 1) /* avoid computing log(0) */
{
@@ -6858,7 +6864,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* in cases where only a single leaf page is expected to be visited. This
* cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
* touched. The number of such pages is btree tree height plus one (ie,
- * we charge for the leaf page too). As above, charge once per SA scan.
+ * we charge for the leaf page too). As above, charge once per estimated
+ * primitive SA scan.
*/
descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9c4930e9a..a431a7543 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4005,6 +4005,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</para>
</note>
+ <note>
+ <para>
+ Every time an index is searched, the index's
+ <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+ field is incremented. This usually happens once per index scan node
+ execution, but might take place several times during execution of a scan
+ that searches for multiple values together. Only queries that use certain
+ <acronym>SQL</acronym> constructs to search for rows matching any value
+ out of a list (or an array) of multiple scalar values are affected. See
+ <xref linkend="functions-comparisons"/> for details.
+ </para>
+ </note>
+
</sect2>
<sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
(1 row)
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
--------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------
Index Only Scan using tenk1_thous_tenthous on tenk1
- Index Cond: (thousand < 2)
- Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
SET enable_indexonlyscan = OFF;
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
- Sort Key: thousand
- -> Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
RESET enable_indexonlyscan;
--
-- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9b8638f28..20b69ff87 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7797,10 +7797,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
Merge Cond: (j1.id1 = j2.id1)
Join Filter: (j2.id2 = j1.id2)
-> Index Scan using j1_id1_idx on j1
- -> Index Only Scan using j2_pkey on j2
+ -> Index Scan using j2_id1_idx on j2
Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
- Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
select * from j1
inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
SET enable_indexonlyscan = OFF;
explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
RESET enable_indexonlyscan;
--
--
2.40.1
On Thu, Sep 28, 2023 at 5:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values).Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.
Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".
Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being. The
commit in question is rather poorly documented, and it's not
immediately clear how to integrate it with my work. I just want to
make sure that there's a testable patch available.
--
Peter Geoghegan
Attachments:
v4-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v4-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload
From 98beda9b64d9258b9886e5f1428abd69527dad2f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v4] Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals. This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.
The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access). The general idea of giving the index AM this
additional context can be pushed a lot further, though.
Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime. The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys. This approach is
far more flexible, and can be far more efficient. Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.
Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals. Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute. These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.
nbtree is now strictly guaranteed to avoid all repeat accesses to any
individual leaf page, even in cases with inequalities on high order
columns (except when the scan direction changes, or the scan restarts).
We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses. The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans. The added cost from repeatedly descending the
index still matters, but it can never dominate.
An important goal of this work is to remove all ScalarArrayOpExpr clause
special cases from the planner -- ScalarArrayOpExpr clauses can now be
thought of a generalization of simple equality clauses (except when
costing index scans, perhaps). The planner no longer needs to generate
alternative index paths with filter quals/qpquals. We assume that true
SAOP index quals are strictly better than filter/qpquals, since the work
in nbtree guarantees that they'll be at least slightly faster.
Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements. They benefit indirectly.
The planner no longer shows any restraint around making SAOP clauses
into true nbtree index quals, which tends to result in significant
savings on heap page accesses. In general we never need visibility
checks to evaluate true index quals, whereas filter quals often need to
perform extra heap accesses, just to eliminate non-matching tuples
(expression evaluation is only safe with known visible tuples).
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
src/include/access/nbtree.h | 39 +-
src/backend/access/nbtree/nbtree.c | 65 +-
src/backend/access/nbtree/nbtsearch.c | 95 +-
src/backend/access/nbtree/nbtutils.c | 1386 ++++++++++++++++++--
src/backend/optimizer/path/indxpath.c | 64 +-
src/backend/utils/adt/selfuncs.c | 123 +-
doc/src/sgml/monitoring.sgml | 13 +
src/test/regress/expected/create_index.out | 61 +-
src/test/regress/expected/join.out | 5 +-
src/test/regress/sql/create_index.sql | 20 +-
10 files changed, 1516 insertions(+), 355 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086..de7dea41c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
/* workspace for SK_SEARCHARRAY support */
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- bool arraysStarted; /* Started array keys, but have yet to "reach
- * past the end" of all arrays? */
int numArrayKeys; /* number of equality-type array keys (-1 if
* there are any unsatisfiable array keys) */
- int arrayKeyCount; /* count indicating number of array scan keys
- * processed */
+ bool needPrimScan; /* Perform another primitive scan? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
+ FmgrInfo *orderProcs; /* ORDER procs for equality constraint keys */
+ int numPrimScans; /* Running tally of # primitive index scans
+ * (used to coordinate parallel workers) */
MemoryContext arrayContext; /* scan-lifespan context for array data */
/* info about killed items if any (killedItems is NULL if never used) */
@@ -1083,6 +1083,29 @@ typedef struct BTScanOpaqueData
typedef BTScanOpaqueData *BTScanOpaque;
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the page high key. This must happen before the
+ * first call to _bt_checkkeys. _bt_checkkeys uses this information to manage
+ * advancement of the scan's array keys.
+ */
+typedef struct BTReadPageState
+{
+ /* Input parameters, set by _bt_readpage */
+ ScanDirection dir; /* current scan direction */
+ IndexTuple highkey; /* page high key, set by forward scans */
+
+ /* Output parameters, set by _bt_checkkeys */
+ bool continuescan; /* Terminate ongoing (primitive) index scan? */
+
+ /* Private _bt_checkkeys-managed state */
+ bool highkeychecked; /* high key checked against current
+ * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
/*
* We use some private sk_flags bits in preprocessed scan keys. We're allowed
* to use bits 16-31 (see skey.h). The uppermost bits are copied from the
@@ -1160,7 +1183,7 @@ extern bool btcanreturn(Relation index, int attno);
extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
@@ -1253,12 +1276,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan,
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup,
bool requiredMatchedByPrecheck);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 92950b377..2a463c420 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
* BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
* to a new page; some process can start doing that.
*
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit). Reached once per primitive index scan.
*/
typedef enum
{
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
BTPS_State btps_pageStatus; /* indicates whether next page is
* available for scan. see above for
* possible states of parallel scan. */
- int btps_arrayKeyCount; /* count indicating number of array scan
- * keys processed by parallel scan */
+ int btps_numPrimScans; /* count indicating number of primitive
+ * index scans (used with array keys) */
slock_t btps_mutex; /* protects above variables */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
} BTParallelScanDescData;
@@ -276,7 +276,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
if (res)
break;
/* ... otherwise see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
return res;
}
@@ -334,7 +334,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
}
}
/* Now see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
return ntids;
}
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->keyData = NULL;
so->arrayKeyData = NULL; /* assume no array keys for now */
- so->arraysStarted = false;
so->numArrayKeys = 0;
+ so->needPrimScan = false;
so->arrayKeys = NULL;
+ so->orderProcs = NULL;
so->arrayContext = NULL;
so->killedItems = NULL; /* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
}
so->markItemIndex = -1;
- so->arrayKeyCount = 0;
+ so->needPrimScan = false;
+ so->numPrimScans = 0;
so->firstPage = false;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
@@ -588,7 +590,7 @@ btinitparallelscan(void *target)
SpinLockInit(&bt_target->btps_mutex);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- bt_target->btps_arrayKeyCount = 0;
+ bt_target->btps_numPrimScans = 0;
ConditionVariableInit(&bt_target->btps_cv);
}
@@ -614,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
SpinLockAcquire(&btscan->btps_mutex);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount = 0;
+ btscan->btps_numPrimScans = 0;
SpinLockRelease(&btscan->btps_mutex);
}
@@ -625,7 +627,17 @@ btparallelrescan(IndexScanDesc scan)
*
* The return value is true if we successfully seized the scan and false
* if we did not. The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys. It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
+ *
+ * XXX This particular aspect of the patch is still at the proof of concept
+ * stage. Having this much available for review at least suggests that it'll
+ * be feasible to port the existing parallel scan array scan key stuff over to
+ * using a primitive index scan counter (as opposed to an array key counter)
+ * the top-level scan. I have yet to really put this code through its paces.
*
* If the return value is true, *pageno returns the next or current page
* of the scan (depending on the scan direction). An invalid block number
@@ -656,16 +668,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
SpinLockAcquire(&btscan->btps_mutex);
pageStatus = btscan->btps_pageStatus;
- if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ if (so->numPrimScans < btscan->btps_numPrimScans)
{
- /* Parallel scan has already advanced to a new set of scankeys. */
+ /* Top-level scan already moved on to next primitive index scan */
status = false;
}
else if (pageStatus == BTPARALLEL_DONE)
{
/*
- * We're done with this set of scankeys. This may be the end, or
- * there could be more sets to try.
+ * We're done with this primitive index scan. This might have
+ * been the final primitive index scan required, or the top-level
+ * index scan might require additional primitive scans.
*/
status = false;
}
@@ -697,9 +710,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
void
_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
{
+ BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
+ Assert(!so->needPrimScan);
+
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
@@ -733,12 +749,11 @@ _bt_parallel_done(IndexScanDesc scan)
parallel_scan->ps_offset);
/*
- * Mark the parallel scan as done for this combination of scan keys,
- * unless some other process already did so. See also
- * _bt_advance_array_keys.
+ * Mark the primitive index scan as done, unless some other process
+ * already did so. See also _bt_array_keys_remain.
*/
SpinLockAcquire(&btscan->btps_mutex);
- if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+ if (so->numPrimScans >= btscan->btps_numPrimScans &&
btscan->btps_pageStatus != BTPARALLEL_DONE)
{
btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -752,14 +767,14 @@ _bt_parallel_done(IndexScanDesc scan)
}
/*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- * keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ * counter when array keys are in use.
*
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
* scans.
*/
void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -768,13 +783,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- so->arrayKeyCount++;
+ so->numPrimScans++;
SpinLockAcquire(&btscan->btps_mutex);
if (btscan->btps_pageStatus == BTPARALLEL_DONE)
{
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount++;
+ btscan->btps_numPrimScans++;
}
SpinLockRelease(&btscan->btps_mutex);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5..a2fc9c691 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
*/
if (!so->qual_ok)
{
- /* Notify any other workers that we're done with this scan key. */
+ /* Notify any other workers that this primitive scan is done */
_bt_parallel_done(scan);
return false;
}
@@ -952,6 +952,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* one we use --- by definition, they are either redundant or
* contradictory.
*
+ * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+ * used to avoid prematurely stopping the scan when an array equality qual
+ * has its array keys advanced.
+ *
* Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
* If the index stores nulls at the end of the index we'll be starting
* from, and we have no boundary key for the column (which means the key
@@ -1537,10 +1541,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
BTPageOpaque opaque;
OffsetNumber minoff;
OffsetNumber maxoff;
+ BTReadPageState pstate;
int itemIndex;
- bool continuescan;
- int indnatts;
- bool requiredMatchedByPrecheck;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1560,8 +1562,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
}
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ pstate.dir = dir;
+ pstate.highkey = NULL;
+ pstate.continuescan = true; /* default assumption */
+ pstate.highkeychecked = false;
+
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
@@ -1613,29 +1618,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
*/
if (!so->firstPage && minoff < maxoff)
{
- ItemId iid;
- IndexTuple itup;
-
- iid = PageGetItemId(page, ScanDirectionIsForward(dir) ? maxoff : minoff);
- itup = (IndexTuple) PageGetItem(page, iid);
-
/*
* Do the precheck. Note that we pass the pointer to
* 'requiredMatchedByPrecheck' to 'continuescan' argument. That will
* set flag to true if all required keys are satisfied and false
* otherwise.
+ *
+ * XXX FIXME
*/
- (void) _bt_checkkeys(scan, itup, indnatts, dir,
- &requiredMatchedByPrecheck, false);
}
else
{
so->firstPage = false;
- requiredMatchedByPrecheck = false;
}
if (ScanDirectionIsForward(dir))
{
+ /* SK_SEARCHARRAY scans must provide high key up front */
+ if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+ }
+
/* load items[] in ascending order */
itemIndex = 0;
@@ -1645,7 +1651,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple itup;
- bool passes_quals;
/*
* If the scan specifies not to return killed tuples, then we
@@ -1659,18 +1664,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, requiredMatchedByPrecheck);
-
- /*
- * If the result of prechecking required keys was true, then in
- * assert-enabled builds we also recheck that the _bt_checkkeys()
- * result is the same.
- */
- Assert(!requiredMatchedByPrecheck ||
- passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, false));
- if (passes_quals)
+ if (_bt_checkkeys(scan, &pstate, itup, false, false))
{
/* tuple passes all scan key conditions */
if (!BTreeTupleIsPosting(itup))
@@ -1703,7 +1697,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
/* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
+ if (!pstate.continuescan)
break;
offnum = OffsetNumberNext(offnum);
@@ -1720,17 +1714,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* only appear on non-pivot tuples on the right sibling page are
* common.
*/
- if (continuescan && !P_RIGHTMOST(opaque))
+ if (pstate.continuescan && !P_RIGHTMOST(opaque))
{
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
+ IndexTuple itup;
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
+ if (pstate.highkey)
+ itup = pstate.highkey;
+ else
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+ }
+
+ _bt_checkkeys(scan, &pstate, itup, true, false);
}
- if (!continuescan)
+ if (!pstate.continuescan)
so->currPos.moreRight = false;
Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1751,6 +1751,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
IndexTuple itup;
bool tuple_alive;
bool passes_quals;
+ bool finaltup = (offnum == minoff);
/*
* If the scan specifies not to return killed tuples, then we
@@ -1761,12 +1762,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* tuple on the page, we do check the index keys, to prevent
* uselessly advancing to the page to the left. This is similar
* to the high key optimization used by forward scans.
+ *
+ * Separately, _bt_checkkeys actually requires that we call it
+ * with the final non-pivot tuple from the page, if there's one
+ * (final processed tuple, or first tuple in offset number terms).
+ * We must indicate which particular tuple comes last, too.
*/
if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
{
Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
+ if (!finaltup)
{
+ Assert(offnum > minoff);
offnum = OffsetNumberPrev(offnum);
continue;
}
@@ -1778,17 +1785,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, requiredMatchedByPrecheck);
-
- /*
- * If the result of prechecking required keys was true, then in
- * assert-enabled builds we also recheck that the _bt_checkkeys()
- * result is the same.
- */
- Assert(!requiredMatchedByPrecheck ||
- passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, false));
+ passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup, false);
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions */
@@ -1827,7 +1824,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
}
- if (!continuescan)
+ if (!pstate.continuescan)
{
/* there can't be any more matches, so stop */
so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1510b97fb..38d4ec463 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
typedef struct BTSortArrayContext
{
- FmgrInfo flinfo;
+ FmgrInfo *orderproc;
Oid collation;
bool reverse;
} BTSortArrayContext;
@@ -41,15 +41,34 @@ typedef struct BTSortArrayContext
static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
StrategyNumber strat,
Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
static int _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(ScanKey cur, FmgrInfo *orderproc,
+ Datum datum, bool null,
+ Datum arrdatum);
+static int _bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+ BTArrayKeyInfo *array, ScanKey cur,
+ FmgrInfo *orderproc, Datum datum, bool null,
+ int32 *final_result);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+ BTReadPageState *pstate,
+ IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir);
static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
ScanKey leftarg, ScanKey rightarg,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger,
+ bool requiredMatchedByPrecheck);
static bool _bt_check_rowcompare(ScanKey skey,
IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
ScanDirection dir, bool *continuescan);
@@ -202,6 +221,21 @@ _bt_freestack(BTStack stack)
* array keys, it's sufficient to find the extreme element value and replace
* the whole array with that scalar value.
*
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved). It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
+ * We always sort and deduplicate arrays up-front for equality array keys.
+ * ScalarArrayOpExpr execution need only visit leaf pages that might contain
+ * matches exactly once, while preserving the sort order of the index. This
+ * isn't just about performance; it also avoids needing duplicate elimination
+ * of matching TIDs (we prefer deduplicating search keys once, up-front).
+ * Equality SK_SEARCHARRAY keys are disjuncts that we always process in
+ * index/key space order, which makes this general approach feasible. Every
+ * index tuple will match no more than one single distinct combination of
+ * equality-constrained keys (array keys and other equality keys).
+ *
* Note: the reason we need so->arrayKeyData, rather than just scribbling
* on scan->keyData, is that callers are permitted to call btrescan without
* supplying a new set of scankey data.
@@ -212,6 +246,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int numberOfKeys = scan->numberOfKeys;
int16 *indoption = scan->indexRelation->rd_indoption;
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
int numArrayKeys;
ScanKey cur;
int i;
@@ -265,6 +300,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+ so->orderProcs = (FmgrInfo *) palloc(nkeyatts * sizeof(FmgrInfo));
/* Now process each array key */
numArrayKeys = 0;
@@ -281,6 +317,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int j;
cur = &so->arrayKeyData[i];
+
+ /*
+ * Attributes with equality-type scan keys (including but not limited
+ * to array scan keys) will need a 3-way comparison function.
+ *
+ * XXX Clean this up some more. This repeats some of the same work
+ * when there are multiple scan keys for the same key column.
+ */
+ if (cur->sk_strategy == BTEqualStrategyNumber)
+ _bt_sort_cmp_func_setup(scan, cur);
+
if (!(cur->sk_flags & SK_SEARCHARRAY))
continue;
@@ -436,6 +483,42 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
return result;
}
+/*
+ * Look up the appropriate comparison function in the opfamily.
+ *
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
+ */
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ Oid elemtype;
+ RegProcedure cmp_proc;
+ FmgrInfo *orderproc = &so->orderProcs[skey->sk_attno - 1];
+
+ /*
+ * Determine the nominal datatype of the array elements. We have to
+ * support the convention that sk_subtype == InvalidOid means the opclass
+ * input type; this is a hack to simplify life for ScanKeyInit().
+ */
+ elemtype = skey->sk_subtype;
+ if (elemtype == InvalidOid)
+ elemtype = rel->rd_opcintype[skey->sk_attno - 1];
+
+ cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+ rel->rd_opcintype[skey->sk_attno - 1],
+ elemtype,
+ BTORDER_PROC);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+ BTORDER_PROC, elemtype, elemtype,
+ rel->rd_opfamily[skey->sk_attno - 1]);
+ fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
/*
* _bt_sort_array_elements() -- sort and de-dup array elements
*
@@ -450,42 +533,14 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems)
{
- Relation rel = scan->indexRelation;
- Oid elemtype;
- RegProcedure cmp_proc;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
BTSortArrayContext cxt;
if (nelems <= 1)
return nelems; /* no work to do */
- /*
- * Determine the nominal datatype of the array elements. We have to
- * support the convention that sk_subtype == InvalidOid means the opclass
- * input type; this is a hack to simplify life for ScanKeyInit().
- */
- elemtype = skey->sk_subtype;
- if (elemtype == InvalidOid)
- elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
- /*
- * Look up the appropriate comparison function in the opfamily.
- *
- * Note: it's possible that this would fail, if the opfamily is
- * incomplete, but it seems quite unlikely that an opfamily would omit
- * non-cross-type support functions for any datatype that it supports at
- * all.
- */
- cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
- elemtype,
- elemtype,
- BTORDER_PROC);
- if (!RegProcedureIsValid(cmp_proc))
- elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
- BTORDER_PROC, elemtype, elemtype,
- rel->rd_opfamily[skey->sk_attno - 1]);
-
/* Sort the array elements */
- fmgr_info(cmp_proc, &cxt.flinfo);
+ cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
cxt.collation = skey->sk_collation;
cxt.reverse = reverse;
qsort_arg(elems, nelems, sizeof(Datum),
@@ -507,7 +562,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
int32 compare;
- compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+ compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
cxt->collation,
da, db));
if (cxt->reverse)
@@ -515,6 +570,171 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
return compare;
}
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This code is loosely based on _bt_compare. However, there are some
+ * important differences.
+ *
+ * It is convenient to think of calling _bt_compare as comparing caller's
+ * insertion scankey to an index tuple. But our callers are not searching
+ * through the index at all -- they're searching through a local array of
+ * datums associated with a scan key (using values they've taken from an index
+ * tuple). This is a complete reversal of how things usually work, which can
+ * be confusing.
+ *
+ * Callers of this function should think of it as comparing "datum" (as well
+ * as "null") to "arrdatum". This is the same approach that _bt_compare takes
+ * in that both functions compare the value that they're searching for to one
+ * particular item used as a binary search pivot. (But it's the wrong way
+ * around if you think of it as "tuple values vs scan key values". So don't.)
+*/
+static inline int32
+_bt_compare_array_skey(ScanKey cur,
+ FmgrInfo *orderproc,
+ Datum datum,
+ bool null,
+ Datum arrdatum)
+{
+ int32 result = 0;
+
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ if (cur->sk_flags & SK_ISNULL) /* array/scan key is NULL */
+ {
+ if (null)
+ result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NULL "<" NOT_NULL */
+ else
+ result = -1; /* NULL ">" NOT_NULL */
+ }
+ else if (null) /* array/scan key is NOT_NULL and tuple item
+ * is NULL */
+ {
+ if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NOT_NULL ">" NULL */
+ else
+ result = 1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * Like _bt_compare, we need to be careful of cross-type comparisons,
+ * so the left value has to be the value that came from an index
+ * tuple. (Array scan keys cannot be cross-type, but other required
+ * scan keys that use an equal operator can be.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+ datum, arrdatum));
+
+ /*
+ * Unlike _bt_compare, we flip the sign when column is a DESC column
+ * (and *not* when column is ASC). This matches the approach taken by
+ * _bt_check_rowcompare, which performs similar three-way comparisons.
+ */
+ if (cur->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan). This allows searches against required scan key arrays to
+ * reuse the work of earlier searches, at least in many important cases.
+ * Array keys covering key space that the index scan already processed cannot
+ * possibly contain any matches.
+ *
+ * XXX There are several fairly obvious optimizations that we could apply here
+ * (e.g., precheck searches for earlier subsets of a larger array would help).
+ * Revisit this during the next round of performance validation.
+ *
+ * Returns an index to the first array element >= caller's datum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to searched-for datum.
+ */
+static int
+_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+ BTArrayKeyInfo *array, ScanKey cur,
+ FmgrInfo *orderproc, Datum datum, bool null,
+ int32 *final_result)
+{
+ int low_elem,
+ high_elem,
+ first_elem_dir,
+ result = 0;
+ bool knownequal = false;
+
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ first_elem_dir = 0;
+ low_elem = array->cur_elem;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ low_elem = 0;
+ }
+ else
+ {
+ first_elem_dir = array->num_elems - 1;
+ low_elem = 0;
+ high_elem = array->cur_elem;
+ if (cur_elem_start)
+ {
+ low_elem = 0;
+ high_elem = first_elem_dir;
+ }
+ }
+
+ while (high_elem > low_elem)
+ {
+ int mid_elem = low_elem + ((high_elem - low_elem) / 2);
+ Datum arrdatum = array->elem_values[mid_elem];
+
+ result = _bt_compare_array_skey(cur, orderproc, datum, null, arrdatum);
+
+ if (result == 0)
+ {
+ /*
+ * Each array was deduplicated during initial preprocessing, so
+ * there each element is guaranteed to be unique. We can quit as
+ * soon as we see an equal array, saving ourselves an extra
+ * comparison or two...
+ */
+ low_elem = mid_elem;
+ knownequal = true;
+ break;
+ }
+
+ if (result > 0)
+ low_elem = mid_elem + 1;
+ else
+ high_elem = mid_elem;
+ }
+
+ /*
+ * ... but our caller also cares about the position of the searched-for
+ * datum relative to the low_elem match we'll return. Make sure that we
+ * set *final_result to the result that comes from comparing low_elem's
+ * key value to the datum that caller had us search for.
+ */
+ if (!knownequal)
+ result = _bt_compare_array_skey(cur, orderproc, datum, null,
+ array->elem_values[low_elem]);
+
+ *final_result = result;
+
+ return low_elem;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -539,82 +759,22 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
curArrayKey->cur_elem = 0;
skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
}
-
- so->arraysStarted = true;
-}
-
-/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
- *
- * Returns true if there is another set of values to consider, false if not.
- * On true result, the scankeys are initialized with the next set of values.
- */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- bool found = false;
- int i;
-
- /*
- * We must advance the last array key most quickly, since it will
- * correspond to the lowest-order index column among the available
- * qualifications. This is necessary to ensure correct ordering of output
- * when there are multiple array keys.
- */
- for (i = so->numArrayKeys - 1; i >= 0; i--)
- {
- BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
- ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
- int cur_elem = curArrayKey->cur_elem;
- int num_elems = curArrayKey->num_elems;
-
- if (ScanDirectionIsBackward(dir))
- {
- if (--cur_elem < 0)
- {
- cur_elem = num_elems - 1;
- found = false; /* need to advance next array key */
- }
- else
- found = true;
- }
- else
- {
- if (++cur_elem >= num_elems)
- {
- cur_elem = 0;
- found = false; /* need to advance next array key */
- }
- else
- found = true;
- }
-
- curArrayKey->cur_elem = cur_elem;
- skey->sk_argument = curArrayKey->elem_values[cur_elem];
- if (found)
- break;
- }
-
- /* advance parallel scan */
- if (scan->parallel_scan != NULL)
- _bt_parallel_advance_array_keys(scan);
-
- /*
- * When no new array keys were found, the scan is "past the end" of the
- * array keys. _bt_start_array_keys can still "restart" the array keys if
- * a rescan is required.
- */
- if (!found)
- so->arraysStarted = false;
-
- return found;
}
/*
* _bt_mark_array_keys() -- Handle array keys during btmarkpos
*
* Save the current state of the array keys as the "mark" position.
+ *
+ * XXX The current set of array keys are not independent of the current scan
+ * position, so why treat them that way?
+ *
+ * We shouldn't even bother remembering the current array keys when btmarkpos
+ * is called. The array keys should be handled lazily instead. If and when
+ * btrestrpos is called, it can just set every array's cur_elem to the first
+ * element for the current scan direction. When _bt_advance_array_keys is
+ * reached (during the first call to _bt_checkkeys that follows), it will
+ * automatically search for the relevant array keys using caller's tuple.
*/
void
_bt_mark_array_keys(IndexScanDesc scan)
@@ -661,13 +821,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
* If we changed any keys, we must redo _bt_preprocess_keys. That might
* sound like overkill, but in cases with multiple keys per index column
* it seems necessary to do the full set of pushups.
- *
- * Also do this whenever the scan's set of array keys "wrapped around" at
- * the end of the last primitive index scan. There won't have been a call
- * to _bt_preprocess_keys from some other place following wrap around, so
- * we do it for ourselves.
*/
- if (changed || !so->arraysStarted)
+ if (changed)
{
_bt_preprocess_keys(scan);
/* The mark should have been set on a consistent set of keys... */
@@ -675,6 +830,785 @@ _bt_restore_array_keys(IndexScanDesc scan)
}
}
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) might need to advance the scan's array
+ * keys.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans). This means that it cannot possibly be time to advance the array
+ * keys just yet. _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfy our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans). This means that it might be time for our
+ * caller to advance the array keys to the next set.
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums. See
+ * comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ bool tuple_before_array_keys = false;
+ ScanKey cur;
+ int ntupatts = BTreeTupleGetNAtts(tuple, rel),
+ ikey;
+
+ Assert(so->qual_ok);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->numberOfKeys > 0);
+ Assert(!so->needPrimScan);
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ int attnum = cur->sk_attno;
+ FmgrInfo *orderproc;
+ Datum datum;
+ bool null,
+ skrequired;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required in the current scan
+ * direction
+ */
+ skrequired = ((ScanDirectionIsForward(dir) &&
+ (cur->sk_flags & SK_BT_REQFWD)) ||
+ (ScanDirectionIsBackward(dir) &&
+ (cur->sk_flags & SK_BT_REQBKWD)));
+
+ /*
+ * Unlike _bt_advance_array_keys, we never deal with any non-required
+ * array keys. Cases where skrequiredtrigger is set to false by
+ * _bt_check_compare should never call here. We are only called after
+ * _bt_check_compare provisionally indicated that the scan should be
+ * terminated due to a _required_ scan key not being satisfied.
+ *
+ * We expect _bt_check_compare to notice and report required scan keys
+ * before non-required ones. _bt_advance_array_keys might still have
+ * to advance non-required array keys in passing for a tuple that we
+ * were called for, but _bt_advance_array_keys doesn't rely on us to
+ * give it advanced notice of that.
+ */
+ if (!skrequired)
+ break;
+
+ if (attnum > ntupatts)
+ {
+ /*
+ * When we reach a high key's truncated attribute, assume that the
+ * tuple attribute's value is >= the scan's search-type scan keys
+ */
+ break;
+ }
+
+ datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+ orderproc = &so->orderProcs[attnum - 1];
+ result = _bt_compare_array_skey(cur, orderproc,
+ datum, null,
+ cur->sk_argument);
+
+ if (result != 0)
+ {
+ if (ScanDirectionIsForward(dir))
+ tuple_before_array_keys = result < 0;
+ else
+ tuple_before_array_keys = result > 0;
+
+ break;
+ }
+ }
+
+ return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if _bt_checkkeys determined that another primitive index scan
+ * must take place by calling _bt_first. Otherwise returns false, indicating
+ * that caller's top-level scan is now past the point where further matching
+ * index tuples can be found (for the current scan direction).
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ * All other scans should just call _bt_first once, no matter what.
+ *
+ * Top-level index scans executed via multiple primitive index scans must not
+ * fail to output index tuples in the usual order for the index -- just like
+ * any other index scan would. The state machine that manages the scan's
+ * array keys must only start primitive index scans when they cover key space
+ * strictly greater than the key space for tuples that the scan has already
+ * returned (or strictly less in the backwards scan case). Otherwise the scan
+ * could output the same index tuples more than once, or in the wrong order.
+ *
+ * This is managed by limiting the cases that can trigger new primitive index
+ * scans to those involving required array scan keys and/or other required
+ * scan keys that use the equality strategy. In particular, the state machine
+ * must not allow high order required scan keys using an inequality strategy
+ * (which are only required in one scan direction) to directly trigger a new
+ * primitive index scan that advances low order non-required array scan keys.
+ * For example, a query such as "SELECT thousand, tenthous FROM tenk1 WHERE
+ * thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand" whose execution
+ * involves a scan of an index on "(thousand, tenthous)" must perform no more
+ * than a single primitive index scan. Otherwise we risk outputting tuples in
+ * the wrong order. Array key values for the non-required scan key on the
+ * "tenthous" column must not dictate top-level scan order. Primitive index
+ * scans mustn't scan tuples already scanned by some earlier primitive scan.
+ *
+ * In fact, nbtree makes a stronger guarantee than is strictly necessary here:
+ * it guarantees that the top-level scan won't repeat any leaf page reads.
+ * (Actually, that can still happen when the scan is repositioned, or the scan
+ * direction changes -- but that's just as true with other types of scans.)
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(so->numArrayKeys);
+
+ /*
+ * Array keys are advanced within _bt_checkkeys when the scan reaches the
+ * leaf level (more precisely, they're advanced when the scan reaches the
+ * end of each distinct set of array elements). This process avoids
+ * repeat access to leaf pages (across multiple primitive index scans) by
+ * opportunistically advancing the scan's array keys when it allows the
+ * primitive index scan to find nearby matching tuples (or to eliminate
+ * array keys with no matching tuples from further consideration).
+ *
+ * _bt_checkkeys sets a simple flag variable that we check here. This
+ * tells us if we need to perform another primitive index scan for the
+ * now-current array keys or not. We'll unset the flag once again to
+ * acknowledge having started a new primitive scan (or we'll see that it
+ * isn't set and end the top-level scan right away).
+ *
+ * We cannot rely on _bt_first always reaching _bt_checkkeys here. There
+ * are various scenarios where that won't happen. For example, if the
+ * index is completely empty, then _bt_first won't get as far as calling
+ * _bt_readpage/_bt_checkkeys.
+ *
+ * We also don't expect _bt_checkkeys to be reached when searching for a
+ * non-existent value that happens to be higher than any existing value in
+ * the index. No _bt_checkkeys are expected when _bt_readpage reads the
+ * rightmost page during such a scan -- even a _bt_checkkeys call against
+ * the high key won't happen. There is an analogous issue for backwards
+ * scans that search for a value lower than all existing index tuples.
+ *
+ * We don't actually require special handling for these cases -- we don't
+ * need to be explicitly instructed to _not_ perform another primitive
+ * index scan. This is correct for all of the cases we've listed so far,
+ * which all involve primitive index scans that access pages "near the
+ * boundaries of the key space" (the leftmost page, the rightmost page, or
+ * an imaginary empty leaf root page). If _bt_checkkeys cannot be reached
+ * by a primitive index scan for one set of array keys, it follows that it
+ * also won't be reached for any later set of array keys.
+ *
+ * There is one exception: the case where _bt_first's _bt_preprocess_keys
+ * call determined that the scan's input scan keys can never be satisfied.
+ * That might be true for one set of array keys, but not the next set.
+ */
+ if (!so->qual_ok)
+ {
+ /*
+ * Qual can never be satisfied. Advance our array keys incrementally.
+ */
+ so->needPrimScan = false;
+ if (_bt_advance_array_keys_increment(scan, dir))
+ return true;
+ }
+
+ /* Time for another primitive index scan? */
+ if (so->needPrimScan)
+ {
+ /* Begin primitive index scan */
+ so->needPrimScan = false;
+
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_next_primitive_scan(scan);
+
+ return true;
+ }
+
+ /*
+ * No more primitive index scans. Just terminate the top-level scan.
+ */
+ _bt_advance_array_keys_to_end(scan, dir);
+
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_done(scan);
+
+ return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false. When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple. It
+ * is okay to let the second call set pstate.continuescan=false without
+ * further intervention, since we know that it can only be for a scan key that
+ * is required in one direction.
+ *
+ * When called with skrequiredtrigger, we don't expect to have to advance any
+ * non-required scan keys. We'll always set pstate.continuescan because a
+ * non-required scan key can never terminate the scan.
+ *
+ * Required array keys are always advanced to the highest element >= the
+ * corresponding tuple attribute values for its most significant non-equal
+ * column (or the next lowest set <= the tuple value during backwards scans).
+ * If we reach the end of the array keys for the current scan direction, we
+ * end the top-level index scan.
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys (or <= during backward
+ * scans). This must be established first, before calling here.
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple. We fall back on "incrementally" advancing the
+ * array keys in these cases, which involve inequality strategy scan keys.
+ * For example, with a composite index on (a, b) and a qual "WHERE a IN (3,5)
+ * AND b < 42", we'll be called for both "a" arry keys (keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42". Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called. The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)". The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time. This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys. We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are. We must not suppress cases where a call to
+ * _bt_check_compare sets continuescan=false for a required scan key that uses
+ * the equality strategy (only inequality-type scan keys get that treatment).
+ * We don't want to suppress the scan's termination when it's inappropriate.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0,
+ ntupatts = BTreeTupleGetNAtts(tuple, rel);
+ bool arrays_advanced = false,
+ arrays_done = false,
+ all_skrequired_atts_wrapped = skrequiredtrigger,
+ all_atts_equal = true;
+
+ Assert(so->numberOfKeys > 0);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->qual_ok);
+
+ /*
+ * Try to advance array keys via a series of binary searches.
+ *
+ * Loop iterates through the current scankeys (so->keyData, which were
+ * output by _bt_preprocess_keys earlier) and then sets input scan keys
+ * (so->arrayKeyData scan keys) to new array values. This sets things up
+ * for our call to _bt_preprocess_keys, which is where the current scan
+ * keys actually change.
+ *
+ * We need to do things this way because only current/preprocessed scan
+ * keys will be marked as required. It's also possible that the previous
+ * call to _bt_preprocess_keys eliminated one or more input scan keys
+ * (possibly array type scan keys) that were deemed to be redundant.
+ */
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array = NULL;
+ ScanKey skeyarray = NULL;
+ FmgrInfo *orderproc;
+ int attnum = cur->sk_attno,
+ first_elem_dir,
+ final_elem_dir,
+ set_elem;
+ Datum datum;
+ bool skrequired,
+ null;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required in the current scan
+ * direction
+ */
+ skrequired = ((ScanDirectionIsForward(dir) &&
+ (cur->sk_flags & SK_BT_REQFWD)) ||
+ (ScanDirectionIsBackward(dir) &&
+ (cur->sk_flags & SK_BT_REQBKWD)));
+
+ /*
+ * Optimization: we don't have to advance remaining non-required array
+ * keys when we already know that tuple won't be returned by the scan.
+ *
+ * Deliberately check this both here and after the binary search.
+ */
+ if (!skrequired && !all_atts_equal)
+ break;
+
+ /*
+ * We need to check required non-array scan keys (that use the equal
+ * strategy), as well as required and non-required array scan keys
+ * (also limited to those that use the equal strategy, since array
+ * inequalities degenerate into a simple comparison).
+ *
+ * Perform initial set up for this scan key. If it is backed by an
+ * array then we need to set variables describing the current position
+ * in the array.
+ */
+ orderproc = &so->orderProcs[attnum - 1];
+ first_elem_dir = final_elem_dir = 0; /* keep compiler quiet */
+ if (cur->sk_flags & SK_SEARCHARRAY)
+ {
+ /* Set up array comparison function */
+ Assert(arrayidx < so->numArrayKeys);
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+
+ /*
+ * It's possible that _bt_preprocess_keys determined that an
+ * individual array scan key wasn't required in so->keyData for
+ * the ongoing primitive index scan due to it being redundant or
+ * contradictory (the current array value might be redundant next
+ * to some other scan key on the same attribute). Deal with that.
+ */
+ if (unlikely(skeyarray->sk_attno != attnum))
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY = false;
+
+ for (; arrayidx < so->numArrayKeys; arrayidx++)
+ {
+ array = &so->arrayKeys[arrayidx];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+ if (skeyarray->sk_attno == attnum)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ Assert(found);
+ }
+
+ /* Proactively set up state used to handle array wraparound */
+ if (ScanDirectionIsForward(dir))
+ {
+ first_elem_dir = 0;
+ final_elem_dir = array->num_elems - 1;
+ }
+ else
+ {
+ first_elem_dir = array->num_elems - 1;
+ final_elem_dir = 0;
+ }
+ }
+ else if (attnum > ntupatts)
+ {
+ /*
+ * Nothing needs to be done when we have a truncated attribute
+ * (possible when caller's tuple is a page high key) and a
+ * non-array scan key
+ */
+ Assert(ScanDirectionIsForward(dir));
+ continue;
+ }
+
+ /*
+ * Here we perform steps for any required scan keys after the first
+ * non-equal required scan key. The first scan key must have been set
+ * to a value > the value from the tuple back when we dealt with it
+ * (or, for a backwards scan, to a value < the value from the tuple).
+ * That needs to "cascade" to lower-order array scan keys. They must
+ * be set to the first array element for the current scan direction.
+ *
+ * We're still setting the keys to values >= the tuple here -- it just
+ * needs to work for the tuple as a whole. For example, when a tuple
+ * "(a, b) = (42, 5)" advances the array keys on "a" from 40 to 45, we
+ * must also set "b" to whatever the first array element for "b" is.
+ * It would be wrong to allow "b" to be set to a value from the tuple,
+ * since the value is actually from a different part of the key space.
+ *
+ * Also defensively do this with truncated attributes when caller's
+ * tuple is a page high key.
+ */
+ if (array && ((arrays_advanced && !all_atts_equal) ||
+ attnum > ntupatts))
+ {
+ /* Shouldn't reach this far for a non-required scan key */
+ Assert(skrequired && skrequiredtrigger && attnum > 1);
+
+ /*
+ * We set the array to the first element (if needed) here, and we
+ * don't unset all_required_atts_wrapped. This array therefore
+ * counts as a wrapped array when we go on to determine if all of
+ * the required arrays have wrapped (after this loop).
+ */
+ if (array->cur_elem != first_elem_dir)
+ {
+ array->cur_elem = first_elem_dir;
+ skeyarray->sk_argument = array->elem_values[first_elem_dir];
+ arrays_advanced = true;
+ }
+
+ continue;
+ }
+
+ /*
+ * Going to compare scan key to corresponding tuple attribute value
+ */
+ datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+ if (!array)
+ {
+ if (!skrequired || !all_atts_equal)
+ continue;
+
+ /*
+ * This is a required non-array scan key that uses the equal
+ * strategy. See header comments for an explanation of why we
+ * need to do this.
+ */
+ result = _bt_compare_array_skey(cur, orderproc, datum, null,
+ cur->sk_argument);
+
+ if (result != 0)
+ {
+ /*
+ * tuple attribute value is > scan key value (or < scan key
+ * value in the backward scan case).
+ */
+ all_atts_equal = false;
+ break;
+ }
+
+ continue;
+ }
+
+ /*
+ * Binary search for an array key >= the tuple value, which we'll then
+ * set as our current array key (or <= the tuple value if this is a
+ * backward scan).
+ *
+ * The binary search excludes array keys that we've already processed
+ * from consideration, except with a non-required scan key's array.
+ * This is not just an optimization -- it's important for correctness.
+ * It is crucial that required array scan keys only have their array
+ * keys advanced in the current scan direction. We need to advance
+ * required array keys in lock step with the index scan.
+ *
+ * Note in particular that arrays_advanced must only be set when the
+ * array is advanced to a key >= the existing key, or <= for a
+ * backwards scan. (Though see notes about wraparound below.)
+ */
+ set_elem = _bt_binsrch_array_skey(dir, (!skrequired || arrays_advanced),
+ array, cur, orderproc, datum, null,
+ &result);
+
+ /*
+ * Maintain the state that tracks whether all attribute from the tuple
+ * are equal to the array keys that we've set as current (or existing
+ * array keys set during earlier calls here).
+ */
+ if (result != 0)
+ all_atts_equal = false;
+
+ /*
+ * Optimization: we don't have to advance remaining non-required array
+ * keys when we already know that tuple won't be returned by the scan.
+ * Quit before setting the array keys to avoid _bt_preprocess_keys.
+ *
+ * Deliberately check this both before and after the binary search.
+ */
+ if (!skrequired && !all_atts_equal)
+ break;
+
+ /*
+ * If the binary search indicates that the key space for this tuple
+ * attribute value is > the key value from the final element in the
+ * array (final for the current scan direction), we handle it by
+ * wrapping around to the first element of the array.
+ *
+ * Wrapping around simplifies advancement with a multi-column index by
+ * allowing us to treat wrapping a column as advancing the column. We
+ * preserve the invariant that a required scan key's array may only be
+ * ratcheted forward (backwards when the scan direction is backwards),
+ * while still always being able to "advance" the array at this point.
+ */
+ if (set_elem == final_elem_dir &&
+ ((ScanDirectionIsForward(dir) && result > 0) ||
+ (ScanDirectionIsBackward(dir) && result < 0)))
+ {
+ /* Perform wraparound */
+ set_elem = first_elem_dir;
+ }
+ else if (skrequired)
+ {
+ /* Won't call _bt_advance_array_keys_to_end later */
+ all_skrequired_atts_wrapped = false;
+ }
+
+ Assert(set_elem >= 0 && set_elem < array->num_elems);
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ skeyarray->sk_argument = array->elem_values[set_elem];
+ arrays_advanced = true;
+
+ /*
+ * We shouldn't have to advance a required array when called due
+ * to _bt_check_compare determining that a non-required array
+ * needs to be advanced. We expect _bt_check_compare to notice
+ * and report required scan keys before non-required ones.
+ */
+ Assert(skrequiredtrigger || !skrequired);
+ }
+ }
+
+ if (!skrequiredtrigger)
+ {
+ /*
+ * Failing to satisfy a non-required array scan key shouldn't ever
+ * result in terminating the (primitive) index scan
+ */
+ }
+ else if (all_skrequired_atts_wrapped)
+ {
+ /*
+ * The binary searches for each tuple's attribute value in the scan
+ * key's corresponding SK_SEARCHARRAY array all found that the tuple's
+ * value are "past the end" of the key space covered by each array
+ */
+ _bt_advance_array_keys_to_end(scan, dir);
+ arrays_done = true;
+ all_atts_equal = false; /* at least not now */
+ }
+ else if (!arrays_advanced)
+ {
+ /*
+ * We must always advance the array keys by at least one increment
+ * (except when called to advance a non-required scan key's array).
+ *
+ * We need this fallback for cases where the existing array keys and
+ * existing required equal-strategy scan keys were fully equal to the
+ * tuple. _bt_check_compare may have set continuescan=false due to an
+ * inequality terminating the scan, which we don't deal with directly.
+ * (See function's header comments for an example.)
+ */
+ if (_bt_advance_array_keys_increment(scan, dir))
+ arrays_advanced = true;
+ else
+ arrays_done = true;
+ all_atts_equal = false; /* at least not now */
+ }
+
+ /*
+ * Might make sense to recheck the high key later on in cases where we
+ * just advanced the keys (unless we were just called to advance the
+ * scan's non-required array keys)
+ */
+ if (arrays_advanced && skrequiredtrigger)
+ pstate->highkeychecked = false;
+
+ /*
+ * If we changed the array keys without exhausting all array keys then we
+ * need to preprocess our search-type scan keys once more
+ */
+ Assert(skrequiredtrigger || !arrays_done);
+ if (arrays_advanced && !arrays_done)
+ {
+ /*
+ * XXX Think about buffer-lock-held hazards here some more.
+ *
+ * In almost all interesting cases we only really need to copy over
+ * the array values (from "so->arrayKeyData" to "so->keyData"). But
+ * there are at least some cases where performing the full set of push
+ * ups here (or close to it) might add value over just doing it for
+ * the main _bt_first call.
+ */
+ _bt_preprocess_keys(scan);
+ }
+
+ /* Are we now done with the top-level scan (barring a btrescan)? */
+ Assert(!so->needPrimScan);
+ if (!so->qual_ok)
+ {
+ /*
+ * Increment array keys and start a new primitive index scan if
+ * _bt_preprocess_keys() discovered that the scan keys can never be
+ * satisfied (eg, x == 2 AND x in (1, 2, 3) for array keys 1 and 2).
+ *
+ * Note: There is similar handling in _bt_array_keys_remain, which
+ * must advance the array keys without consulting us in this one case.
+ */
+ Assert(skrequiredtrigger);
+
+ pstate->continuescan = false;
+ pstate->highkeychecked = true;
+ all_atts_equal = false; /* at least not now */
+
+ if (_bt_advance_array_keys_increment(scan, dir))
+ so->needPrimScan = true;
+ }
+ else if (!skrequiredtrigger)
+ {
+ /* Not when we failed to satisfy a non-required scan key, ever */
+ Assert(!arrays_done);
+ pstate->continuescan = true;
+ }
+ else if (arrays_done)
+ {
+ /*
+ * Yep -- this primitive scan was our last
+ */
+ Assert(!all_atts_equal);
+ pstate->continuescan = false;
+ }
+ else if (!all_atts_equal)
+ {
+ /*
+ * Not done. The top-level index scan (and primitive index scan) will
+ * continue, since the array keys advanced.
+ */
+ Assert(arrays_advanced);
+ pstate->continuescan = true;
+
+ /*
+ * Some required array keys might have wrapped around during this
+ * call, but it can't have been the most significant array scan key.
+ */
+ Assert(!all_skrequired_atts_wrapped);
+ }
+ else
+ {
+ /*
+ * Not done. A second call to _bt_check_compare must now take place.
+ * It will make the final decision on setting continuescan.
+ */
+ }
+
+ return all_atts_equal;
+}
+
+/*
+ * Advance the array keys by a single increment in the current scan direction
+ */
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found = false;
+ int i;
+
+ Assert(!so->needPrimScan);
+
+ /*
+ * We must advance the last array key most quickly, since it will
+ * correspond to the lowest-order index column among the available
+ * qualifications. This is necessary to ensure correct ordering of output
+ * when there are multiple array keys.
+ */
+ for (i = so->numArrayKeys - 1; i >= 0; i--)
+ {
+ BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
+ int cur_elem = curArrayKey->cur_elem;
+ int num_elems = curArrayKey->num_elems;
+
+ if (ScanDirectionIsBackward(dir))
+ {
+ if (--cur_elem < 0)
+ {
+ cur_elem = num_elems - 1;
+ found = false; /* need to advance next array key */
+ }
+ else
+ found = true;
+ }
+ else
+ {
+ if (++cur_elem >= num_elems)
+ {
+ cur_elem = 0;
+ found = false; /* need to advance next array key */
+ }
+ else
+ found = true;
+ }
+
+ curArrayKey->cur_elem = cur_elem;
+ skey->sk_argument = curArrayKey->elem_values[cur_elem];
+ if (found)
+ break;
+ }
+
+ return found;
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ */
+static void
+_bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(so->numArrayKeys);
+ Assert(!so->needPrimScan);
+
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
+ int reset_elem;
+
+ if (ScanDirectionIsForward(dir))
+ reset_elem = curArrayKey->num_elems - 1;
+ else
+ reset_elem = 0;
+
+ if (curArrayKey->cur_elem != reset_elem)
+ {
+ curArrayKey->cur_elem = reset_elem;
+ skey->sk_argument = curArrayKey->elem_values[reset_elem];
+ }
+ }
+}
/*
* _bt_preprocess_keys() -- Preprocess scan keys
@@ -1360,41 +2294,210 @@ _bt_mark_scankey_required(ScanKey skey)
*
* Return true if so, false if not. If the tuple fails to pass the qual,
* we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
+ * this tuple, and set pstate.continuescan accordingly. See comments for
* _bt_preprocess_keys(), above, about how this is done.
*
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the high
+ * key early, before we've expended too much effort on comparing tuples that
+ * cannot possibly be matches for any set of array keys. This is just an
+ * optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate. These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards). Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction). Any other order will
+ * lead to inconsistent array key state.
*
* scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
* tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
* requiredMatchedByPrecheck: indicates that scan keys required for
* direction scan are already matched
*/
bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup,
bool requiredMatchedByPrecheck)
{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
+ TupleDesc tupdesc = RelationGetDescr(scan->indexRelation);
+ int natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool res;
+ bool skrequiredtrigger;
+
+ Assert(so->qual_ok);
+ Assert(pstate->continuescan);
+ Assert(!so->needPrimScan);
+
+ res = _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+ tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger,
+ requiredMatchedByPrecheck);
+
+ /*
+ * Only one _bt_check_compare call is required in the common case where
+ * there are no equality-type array scan keys.
+ *
+ * When there are array scan keys then we can still accept the first
+ * answer we get from _bt_check_compare when continuescan wasn't unset.
+ */
+ if (!so->numArrayKeys || pstate->continuescan)
+ return res;
+
+ /*
+ * _bt_check_compare set continuescan=false in the presence of equality
+ * type array keys. It's possible that we haven't reached the start of
+ * the array keys just yet. It's also possible that we need to advance
+ * the array keys now. (Or perhaps we really do need to terminate the
+ * top-level scan.)
+ */
+ pstate->continuescan = true; /* new initial assumption */
+
+ if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+ {
+ /*
+ * Tuple is still < the current array scan key values (as well as
+ * other equality type scan keys) if this is a forward scan.
+ * (Backwards scans reach here with a tuple > equality constraints.)
+ * We must now consider how to proceed with the ongoing primitive
+ * index scan.
+ *
+ * Should _bt_readpage continue with this page for now, in the hope of
+ * finding tuples whose key space is covered by the current array keys
+ * before too long? Or, should it give up and start a new primitive
+ * index scan instead?
+ *
+ * Our policy is to terminate the primitive index scan at the end of
+ * the current page if the current (most recently advanced) array keys
+ * don't cover the final tuple from the page. This policy is fairly
+ * conservative.
+ *
+ * Note: In some cases we're effectively speculating that the next
+ * sibling leaf page will have tuples that are covered by the key
+ * space of our array keys (the current set or some nearby set), based
+ * on a cue from the current page's final tuple. There is at least a
+ * non-zero risk of wasting a page access -- we could gamble and lose.
+ * The details of all this are handled within _bt_advance_array_keys.
+ */
+ if (finaltup || (!pstate->highkeychecked && pstate->highkey &&
+ _bt_tuple_before_array_skeys(scan, pstate,
+ pstate->highkey)))
+ {
+ /*
+ * This is the final tuple (the high key for forward scans, or the
+ * tuple at the first offset number for backward scans), but it is
+ * still before the current array keys. As such, we're unwilling
+ * to allow the current primitive index scan to continue to the
+ * next leaf page.
+ *
+ * Start a new primitive index scan. The next primitive index
+ * scan (in the next _bt_first call) is expected to reposition the
+ * scan to some much later leaf page. (If we had a good reason to
+ * think that the next leaf page that will be scanned will turn
+ * out to be close to our current position, then we wouldn't be
+ * starting another primitive index scan.)
+ *
+ * Note: _bt_readpage stashes the page high key, which allows us
+ * to make this check early (for forward scans). We thereby avoid
+ * scanning very many extra tuples on the page. This is just an
+ * optimization; skipping these useless comparisons should never
+ * change our final conclusion about what the scan should do next.
+ */
+ pstate->continuescan = false;
+ so->needPrimScan = true;
+ }
+ else if (!finaltup && pstate->highkey)
+ {
+ /*
+ * Remember that the high key has been checked with this
+ * particular set of array keys.
+ *
+ * It might make sense to check the same high key again at some
+ * point during the ongoing _bt_readpage-wise scan of this page.
+ * But it is definitely wasteful to repeat the same high key check
+ * before the array keys are advanced by some later tuple.
+ */
+ pstate->highkeychecked = true;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual
+ */
+ return false;
+ }
+
+ /*
+ * Caller's tuple is >= the current set of array keys and other equality
+ * constraint scan keys (or <= if this is a backwards scans).
+ *
+ * It might be time to advance the array keys to the next set. Try doing
+ * that now, while determining in passing if the tuple matches the newly
+ * advanced set of array keys (if we've any left).
+ *
+ * This call will also set continuescan for us (or tells us to perform
+ * another _bt_check_compare call, which then sets continuescan for us).
+ */
+ if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+ {
+ /*
+ * Tuple doesn't match any later array keys, either (for one or more
+ * array type scan keys marked as required). Give up on this tuple
+ * being a match. (Call may have also terminated the primitive scan,
+ * or the top-level scan.)
+ */
+ return false;
+ }
+
+ /*
+ * Advanced array keys to values that are exact matches for corresponding
+ * attribute values from the tuple.
+ *
+ * It's fairly likely that the tuple satisfies all index scan conditions
+ * at this point, but we need confirmation of that. We also need to give
+ * _bt_check_compare a real opportunity to end the top-level index scan by
+ * setting continuescan=false. (_bt_advance_array_keys cannot deal with
+ * inequality strategy scan keys; we need _bt_check_compare for those.)
+ */
+ return _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+ tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger,
+ requiredMatchedByPrecheck);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys. It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan. It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger,
+ bool requiredMatchedByPrecheck)
+{
int ikey;
ScanKey key;
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
*continuescan = true; /* default assumption */
+ *skrequiredtrigger = true; /* default assumption */
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ for (key = keyData, ikey = 0; ikey < keysz; key++, ikey++)
{
Datum datum;
bool isNull;
@@ -1525,18 +2628,11 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* opposite direction scan, it must be already satisfied by
* _bt_first() except for the NULLs checking, which have already done
* above.
+ *
+ * FIXME
*/
- if (!requiredOppositeDir)
- {
- test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument);
- }
- else
- {
- test = true;
- Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
- datum, key->sk_argument));
- }
+ test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+ datum, key->sk_argument);
if (!DatumGetBool(test))
{
@@ -1549,10 +2645,22 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* qual fails, it is critical that equality quals be used for the
* initial positioning in _bt_first() when they are available. See
* comments in _bt_first().
+ *
+ * Scans with equality-type array scan keys run into a similar
+ * problem whenever they advance the array keys. Our caller uses
+ * _bt_tuple_before_array_skeys to avoid the problem there.
*/
if (requiredSameDir)
*continuescan = false;
+ if ((key->sk_flags & SK_SEARCHARRAY) &&
+ key->sk_strategy == BTEqualStrategyNumber)
+ {
+ if (*continuescan)
+ *skrequiredtrigger = false;
+ *continuescan = false;
+ }
+
/*
* In any case, this indextuple doesn't match the qual.
*/
@@ -1571,7 +2679,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* it's not possible for any future tuples in the current scan direction
* to pass the qual.
*
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
*/
static bool
_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..f04ca1ee9 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop);
+ bool *skip_nonnative_saop);
static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
List *clauses, List *other_clauses);
static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
* index AM supports them natively, we should just include them in simple
* index paths. If not, we should exclude them while building simple index
* paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
*/
static void
get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
{
List *indexpaths;
bool skip_nonnative_saop = false;
- bool skip_lower_saop = false;
ListCell *lc;
/*
* Build simple index paths using the clauses. Allow ScalarArrayOpExpr
- * clauses only if the index AM supports them natively, and skip any such
- * clauses for index columns after the first (so that we produce ordered
- * paths if possible).
+ * clauses only if the index AM supports them natively.
*/
indexpaths = build_index_paths(root, rel,
index, clauses,
index->predOK,
ST_ANYSCAN,
- &skip_nonnative_saop,
- &skip_lower_saop);
-
- /*
- * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
- * that supports them, then try again including those clauses. This will
- * produce paths with more selectivity but no ordering.
- */
- if (skip_lower_saop)
- {
- indexpaths = list_concat(indexpaths,
- build_index_paths(root, rel,
- index, clauses,
- index->predOK,
- ST_ANYSCAN,
- &skip_nonnative_saop,
- NULL));
- }
+ &skip_nonnative_saop);
/*
* Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
index, clauses,
false,
ST_BITMAPSCAN,
- NULL,
NULL);
*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
* to true if we found any such clauses (caller must initialize the variable
* to false). If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
*
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false). If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
* 'rel' is the index's heap relation
* 'index' is the index for which we want to generate paths
* 'clauses' is the collection of indexable clauses (IndexClause nodes)
* 'useful_predicate' indicates whether the index has a useful predicate
* 'scantype' indicates whether we need plain or bitmap scan support
* 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
*/
static List *
build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop)
+ bool *skip_nonnative_saop)
{
List *result = NIL;
IndexPath *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
- bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
bool index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
* on by btree and possibly other places.) The list can be empty, if the
* index AM allows that.
*
- * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
- * index clause for a non-first index column. This prevents us from
- * assuming that the scan result is ordered. (Actually, the result is
- * still ordered if there are equality constraints for all earlier
- * columns, but it seems too expensive and non-modular for this code to be
- * aware of that refinement.)
- *
* We also build a Relids set showing which outer rels are required by the
* selected clauses. Any lateral_relids are included in that, but not
* otherwise accounted for.
*/
index_clauses = NIL;
- found_lower_saop_clause = false;
outer_relids = bms_copy(rel->lateral_relids);
for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
{
@@ -917,16 +876,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/* Caller had better intend this only for bitmap scan */
Assert(scantype == ST_BITMAPSCAN);
}
- if (indexcol > 0)
- {
- if (skip_lower_saop)
- {
- /* Caller doesn't want to lose index ordering */
- *skip_lower_saop = true;
- continue;
- }
- found_lower_saop_clause = true;
- }
}
/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/*
* 2. Compute pathkeys describing index's ordering, if any, then see how
* many of them are actually useful for this query. This is not relevant
- * if we are only trying to build bitmap indexscans, nor if we have to
- * assume the scan is unordered.
+ * if we are only trying to build bitmap indexscans.
*/
pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
- !found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
index, &clauseset,
useful_predicate,
ST_BITMAPSCAN,
- NULL,
NULL);
result = list_concat(result, indexpaths);
}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..c796b53a6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
double numIndexTuples;
double spc_random_page_cost;
double num_sa_scans;
- double num_outer_scans;
- double num_scans;
double qual_op_cost;
double qual_arg_cost;
List *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
/*
* Check for ScalarArrayOpExpr index quals, and estimate the number of
- * index scans that will be performed.
+ * primitive index scans that will be performed for caller
*/
num_sa_scans = 1;
foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
*/
numIndexTuples = costs->numIndexTuples;
if (numIndexTuples <= 0.0)
- {
numIndexTuples = indexSelectivity * index->rel->tuples;
- /*
- * The above calculation counts all the tuples visited across all
- * scans induced by ScalarArrayOpExpr nodes. We want to consider the
- * average per-indexscan number, so adjust. This is a handy place to
- * round to integer, too. (If caller supplied tuple estimate, it's
- * responsible for handling these considerations.)
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
- }
-
/*
* We can bound the number of tuples by the index size in any case. Also,
* always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
*
* The above calculations are all per-index-scan. However, if we are in a
* nestloop inner scan, we can expect the scan to be repeated (with
- * different search keys) for each row of the outer relation. Likewise,
- * ScalarArrayOpExpr quals result in multiple index scans. This creates
- * the potential for cache effects to reduce the number of disk page
- * fetches needed. We want to estimate the average per-scan I/O cost in
- * the presence of caching.
+ * different search keys) for each row of the outer relation. This
+ * creates the potential for cache effects to reduce the number of disk
+ * page fetches needed. We want to estimate the average per-scan I/O cost
+ * in the presence of caching.
*
* We use the Mackert-Lohman formula (see costsize.c for details) to
* estimate the total number of page fetches that occur. While this
* wasn't what it was designed for, it seems a reasonable model anyway.
* Note that we are counting pages not tuples anymore, so we take N = T =
* index size, as if there were one "tuple" per page.
+ *
+ * Note: we assume that there will be no repeat index page fetches across
+ * ScalarArrayOpExpr primitive scans from the same logical index scan.
+ * This is guaranteed to be true for btree indexes, but is very optimistic
+ * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+ * However, these same index AMs also accept our default pessimistic
+ * approach to counting num_sa_scans (btree caller caps this), so we don't
+ * expect the final indexTotalCost to be wildly over-optimistic.
*/
- num_outer_scans = loop_count;
- num_scans = num_sa_scans * num_outer_scans;
-
- if (num_scans > 1)
+ if (loop_count > 1)
{
double pages_fetched;
/* total page fetches ignoring cache effects */
- pages_fetched = numIndexPages * num_scans;
+ pages_fetched = numIndexPages * loop_count;
/* use Mackert and Lohman formula to adjust for cache effects */
pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
/*
* Now compute the total disk access cost, and then report a pro-rated
- * share for each outer scan. (Don't pro-rate for ScalarArrayOpExpr,
- * since that's internal to the indexscan.)
+ * share for each outer scan
*/
- indexTotalCost = (pages_fetched * spc_random_page_cost)
- / num_outer_scans;
+ indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
}
else
{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
* evaluated once at the start of the scan to reduce them to runtime keys
* to pass to the index AM (see nodeIndexscan.c). We model the per-tuple
* CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
- * indexqual operator. Because we have numIndexTuples as a per-scan
- * number, we have to multiply by num_sa_scans to get the correct result
- * for ScalarArrayOpExpr cases. Similarly add in costs for any index
- * ORDER BY expressions.
+ * indexqual operator. Similarly add in costs for any index ORDER BY
+ * expressions.
*
* Note: this neglects the possible costs of rechecking lossy operators.
* Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
indexStartupCost = qual_arg_cost;
indexTotalCost += qual_arg_cost;
- indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+ indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
/*
* Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
bool eqQualHere;
bool found_saop;
bool found_is_null_op;
- double num_sa_scans;
ListCell *lc;
/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
*
* For a RowCompareExpr, we consider only the first column, just as
* rowcomparesel() does.
- *
- * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
- * index scans not one, but the ScalarArrayOpExpr's operator can be
- * considered to act the same as it normally does.
*/
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
found_saop = false;
found_is_null_op = false;
- num_sa_scans = 1;
foreach(lc, path->indexclauses)
{
IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
else if (IsA(clause, ScalarArrayOpExpr))
{
ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
- Node *other_operand = (Node *) lsecond(saop->args);
- int alength = estimate_array_length(other_operand);
clause_op = saop->opno;
found_saop = true;
- /* count number of SA scans induced by indexBoundQuals only */
- if (alength > 1)
- num_sa_scans *= alength;
}
else if (IsA(clause, NullTest))
{
@@ -6805,9 +6781,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
Selectivity btreeSelectivity;
/*
- * If the index is partial, AND the index predicate with the
- * index-bound quals to produce a more accurate idea of the number of
- * rows covered by the bound conditions.
+ * AND the index predicate with the index-bound quals to produce a
+ * more accurate idea of the number of rows covered by the bound
+ * conditions
*/
selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
JOIN_INNER,
NULL);
numIndexTuples = btreeSelectivity * index->rel->tuples;
-
- /*
- * As in genericcostestimate(), we have to adjust for any
- * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
- * to integer.
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
}
/*
@@ -6832,6 +6801,43 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
genericcostestimate(root, path, loop_count, &costs);
+ /*
+ * Now compensate for btree's ability to efficiently execute scans with
+ * SAOP clauses.
+ *
+ * btree automatically combines individual ScalarArrayOpExpr primitive
+ * index scans whenever the tuples covered by the next set of array keys
+ * are close to tuples covered by the current set. This makes the final
+ * number of descents particularly difficult to estimate. However, btree
+ * scans never visit any single leaf page more than once. That puts a
+ * natural floor under the worst case number of descents.
+ *
+ * It's particularly important that we not wildly overestimate the number
+ * of descents needed for a clause list with several SAOPs -- the costs
+ * really aren't multiplicative in the way genericcostestimate expects. In
+ * general, most distinct combinations of SAOP keys will tend to not find
+ * any matching tuples. Furthermore, btree scans search for the next set
+ * of array keys using the next tuple in line, and so won't even need a
+ * direct comparison to eliminate most non-matching sets of array keys.
+ *
+ * Clamp the number of descents to the estimated number of leaf page
+ * visits. This is still fairly pessimistic, but tends to result in more
+ * accurate costing of scans with several SAOP clauses -- especially when
+ * each array has more than a few elements. The cost of adding additional
+ * array constants to a low-order SAOP column should saturate past a
+ * certain point (except where selectivity estimates continue to shift).
+ *
+ * Also clamp the number of descents to 1/3 the number of index pages.
+ * This avoids implausibly high estimates with low selectivity paths,
+ * where scans frequently require no more than one or two descents.
+ */
+ if (costs.num_sa_scans > 1)
+ {
+ costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+ costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+ costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+ }
+
/*
* Add a CPU-cost component to represent the costs of initial btree
* descent. We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6845,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* comparisons to descend a btree of N leaf tuples. We charge one
* cpu_operator_cost per comparison.
*
- * If there are ScalarArrayOpExprs, charge this once per SA scan. The
- * ones after the first one are not startup cost so far as the overall
- * plan is concerned, so add them only to "total" cost.
+ * If there are ScalarArrayOpExprs, charge this once per estimated
+ * primitive SA scan. The ones after the first one are not startup cost
+ * so far as the overall plan goes, so just add them to "total" cost.
*/
if (index->tuples > 1) /* avoid computing log(0) */
{
@@ -6858,7 +6864,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* in cases where only a single leaf page is expected to be visited. This
* cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
* touched. The number of such pages is btree tree height plus one (ie,
- * we charge for the leaf page too). As above, charge once per SA scan.
+ * we charge for the leaf page too). As above, charge once per estimated
+ * primitive SA scan.
*/
descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1149093a8..6a5068c72 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4005,6 +4005,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</para>
</note>
+ <note>
+ <para>
+ Every time an index is searched, the index's
+ <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+ field is incremented. This usually happens once per index scan node
+ execution, but might take place several times during execution of a scan
+ that searches for multiple values together. Only queries that use certain
+ <acronym>SQL</acronym> constructs to search for rows matching any value
+ out of a list (or an array) of multiple scalar values are affected. See
+ <xref linkend="functions-comparisons"/> for details.
+ </para>
+ </note>
+
</sect2>
<sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
(1 row)
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
--------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------
Index Only Scan using tenk1_thous_tenthous on tenk1
- Index Cond: (thousand < 2)
- Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
SET enable_indexonlyscan = OFF;
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
- Sort Key: thousand
- -> Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
RESET enable_indexonlyscan;
--
-- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index b95d30f65..25815634c 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7795,10 +7795,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
Merge Cond: (j1.id1 = j2.id1)
Join Filter: (j2.id2 = j1.id2)
-> Index Scan using j1_id1_idx on j1
- -> Index Only Scan using j2_pkey on j2
+ -> Index Scan using j2_id1_idx on j2
Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
- Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
select * from j1
inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
SET enable_indexonlyscan = OFF;
explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
RESET enable_indexonlyscan;
--
--
2.42.0
On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being.
Attached is v5, which deals with the conflict with the optimization
added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
optimization is now only disabled in cases without array scan keys.
(It'd be very hard to make it work with array scan keys, since an
important principle for my patch is that we can change search-type
scan keys right in the middle of any _bt_readpage() call).
v5 also fixes a longstanding open item for the patch: we no longer
call _bt_preprocess_keys() with a buffer lock held, which was a bad
idea at best, and unsafe (due to the syscache lookups within
_bt_preprocess_keys) at worst. A new, minimal version of the function
(called _bt_preprocess_keys_leafbuf) is called at the same point
instead. That change, combined with the array binary search stuff
(which was added back in v2), makes the total amount of work performed
with a buffer lock held totally reasonable in all cases. It's even
okay in extreme or adversarial cases with many millions of array keys.
Making this _bt_preprocess_keys_leafbuf approach work has a downside:
it requires that _bt_preprocess_keys be a little less aggressive about
removing redundant scan keys, in order to meet certain assumptions
held by the new _bt_preprocess_keys_leafbuf function. Essentially,
_bt_preprocess_keys must now worry about current and future array key
values when determining redundancy among scan keys -- not just the
current array key values. _bt_preprocess_keys knows nothing about
SK_SEARCHARRAY scan keys on HEAD, because on HEAD there is a strict
1:1 correspondence between the number of primitive index scans and the
number of array keys (actually, the number of distinct combinations of
array keys). Obviously that's no longer the case with the patch
(that's the whole point of the patch).
It's easiest to understand how elimination of redundant quals needs to
work in v5 by way of an example. Consider the following query:
select count(*), two, four, twenty, hundred
from
tenk1
where
two in (0, 1) and four in (1, 2, 3)
and two < 1;
Notice that "two" appears in the where clause twice. First it appears
as an SAOP, and then as an inequality. Right now, on HEAD, the
primitive index scan where the SAOP's scankey is "two = 0" renders
"two < 1" redundant. However, the subsequent primitive index scan
where "two = 1" does *not* render "two < 1" redundant. This has
implications for the mechanism in the patch, since the patch will
perform one big primitive index scan for all array constants, with
only a single _bt_preprocess_keys call at the start of its one and
only _bt_first call (but with multiple _bt_preprocess_keys_leafbuf
calls once we reach the leaf level).
The compromise that I've settled on in v5 is to teach
_bt_preprocess_keys to *never* treat "two < 1" as redundant with such
a query -- even though there is some squishy sense in which "two < 1"
is indeed still redundant (for the first SAOP key of value 0). My
approach is reasonably well targeted in that it mostly doesn't affect
queries that don't need it. But it will add cycles to some badly
written queries that wouldn't have had them in earlier Postgres
versions. I'm not entirely sure how much this matters, but my current
sense is that it doesn't matter all that much. This is the kind of
thing that is hard to test and poorly tested, so simplicity is even
more of a virtue than usual.
Note that the changes to _bt_preprocess_keys in v5 *don't* affect how
we determine if the scan has contradictory quals, which is generally
more important. With contradictory quals, _bt_first can avoid reading
any data from the index. OTOH eliminating redundant quals (i.e. the
thing that v5 *does* change) merely makes evaluating index quals less
expensive via preprocessing-away unneeded scan keys. In other words,
while it's possible that the approach taken by v5 will add CPU cycles
in a small number of cases, it should never result in more page
accesses.
--
Peter Geoghegan
Attachments:
v5-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v5-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload
From 9e09dd71c0981048d70cce80e7b211844c1b755f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v5] Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals. This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.
The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access). The general idea of giving the index AM this
additional context can be pushed a lot further, though.
Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime. The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys. This approach is
far more flexible, and can be far more efficient. Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.
Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals. Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute. These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.
We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses. The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans. The added cost from repeatedly descending the
index still matters, but it can never be completely dominant.
Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements. They benefit indirectly.
In general it is better to use true index quals instead of filter quals,
since it avoids extra heap accesses when eliminating non-matching tuples
via expression evaluation (in general expression evaluation is only safe
with tuples that are known visible). The nbtree work removes what was
really an artificial downside for index quals, leaving no reason for the
planner to even consider SAOP clause index filter quals anymore. This
is especially likely to help with selective index scans with SAOP
clauses on low-order index columns.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
src/include/access/nbtree.h | 39 +-
src/backend/access/nbtree/nbtree.c | 59 +-
src/backend/access/nbtree/nbtsearch.c | 84 +-
src/backend/access/nbtree/nbtutils.c | 1342 ++++++++++++++++++--
src/backend/optimizer/path/indxpath.c | 64 +-
src/backend/utils/adt/selfuncs.c | 123 +-
doc/src/sgml/monitoring.sgml | 13 +
src/test/regress/expected/create_index.out | 61 +-
src/test/regress/expected/join.out | 5 +-
src/test/regress/sql/create_index.sql | 20 +-
10 files changed, 1484 insertions(+), 326 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086..de7dea41c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
/* workspace for SK_SEARCHARRAY support */
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- bool arraysStarted; /* Started array keys, but have yet to "reach
- * past the end" of all arrays? */
int numArrayKeys; /* number of equality-type array keys (-1 if
* there are any unsatisfiable array keys) */
- int arrayKeyCount; /* count indicating number of array scan keys
- * processed */
+ bool needPrimScan; /* Perform another primitive scan? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
+ FmgrInfo *orderProcs; /* ORDER procs for equality constraint keys */
+ int numPrimScans; /* Running tally of # primitive index scans
+ * (used to coordinate parallel workers) */
MemoryContext arrayContext; /* scan-lifespan context for array data */
/* info about killed items if any (killedItems is NULL if never used) */
@@ -1083,6 +1083,29 @@ typedef struct BTScanOpaqueData
typedef BTScanOpaqueData *BTScanOpaque;
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the page high key. This must happen before the
+ * first call to _bt_checkkeys. _bt_checkkeys uses this information to manage
+ * advancement of the scan's array keys.
+ */
+typedef struct BTReadPageState
+{
+ /* Input parameters, set by _bt_readpage */
+ ScanDirection dir; /* current scan direction */
+ IndexTuple highkey; /* page high key, set by forward scans */
+
+ /* Output parameters, set by _bt_checkkeys */
+ bool continuescan; /* Terminate ongoing (primitive) index scan? */
+
+ /* Private _bt_checkkeys-managed state */
+ bool highkeychecked; /* high key checked against current
+ * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
/*
* We use some private sk_flags bits in preprocessed scan keys. We're allowed
* to use bits 16-31 (see skey.h). The uppermost bits are copied from the
@@ -1160,7 +1183,7 @@ extern bool btcanreturn(Relation index, int attno);
extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
@@ -1253,12 +1276,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan,
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup,
bool requiredMatchedByPrecheck);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 92950b377..f963c3fe7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
* BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
* to a new page; some process can start doing that.
*
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit). Reached once per primitive index scan.
*/
typedef enum
{
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
BTPS_State btps_pageStatus; /* indicates whether next page is
* available for scan. see above for
* possible states of parallel scan. */
- int btps_arrayKeyCount; /* count indicating number of array scan
- * keys processed by parallel scan */
+ int btps_numPrimScans; /* count indicating number of primitive
+ * index scans (used with array keys) */
slock_t btps_mutex; /* protects above variables */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
} BTParallelScanDescData;
@@ -276,7 +276,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
if (res)
break;
/* ... otherwise see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
return res;
}
@@ -334,7 +334,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
}
}
/* Now see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
return ntids;
}
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->keyData = NULL;
so->arrayKeyData = NULL; /* assume no array keys for now */
- so->arraysStarted = false;
so->numArrayKeys = 0;
+ so->needPrimScan = false;
so->arrayKeys = NULL;
+ so->orderProcs = NULL;
so->arrayContext = NULL;
so->killedItems = NULL; /* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
}
so->markItemIndex = -1;
- so->arrayKeyCount = 0;
+ so->needPrimScan = false;
+ so->numPrimScans = 0;
so->firstPage = false;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
@@ -588,7 +590,7 @@ btinitparallelscan(void *target)
SpinLockInit(&bt_target->btps_mutex);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- bt_target->btps_arrayKeyCount = 0;
+ bt_target->btps_numPrimScans = 0;
ConditionVariableInit(&bt_target->btps_cv);
}
@@ -614,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
SpinLockAcquire(&btscan->btps_mutex);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount = 0;
+ btscan->btps_numPrimScans = 0;
SpinLockRelease(&btscan->btps_mutex);
}
@@ -625,7 +627,11 @@ btparallelrescan(IndexScanDesc scan)
*
* The return value is true if we successfully seized the scan and false
* if we did not. The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys. It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
*
* If the return value is true, *pageno returns the next or current page
* of the scan (depending on the scan direction). An invalid block number
@@ -656,16 +662,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
SpinLockAcquire(&btscan->btps_mutex);
pageStatus = btscan->btps_pageStatus;
- if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ if (so->numPrimScans < btscan->btps_numPrimScans)
{
- /* Parallel scan has already advanced to a new set of scankeys. */
+ /* Top-level scan already moved on to next primitive index scan */
status = false;
}
else if (pageStatus == BTPARALLEL_DONE)
{
/*
- * We're done with this set of scankeys. This may be the end, or
- * there could be more sets to try.
+ * We're done with this primitive index scan. This might have
+ * been the final primitive index scan required, or the top-level
+ * index scan might require additional primitive scans.
*/
status = false;
}
@@ -697,9 +704,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
void
_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
{
+ BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
+ Assert(!so->needPrimScan);
+
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
@@ -733,12 +743,11 @@ _bt_parallel_done(IndexScanDesc scan)
parallel_scan->ps_offset);
/*
- * Mark the parallel scan as done for this combination of scan keys,
- * unless some other process already did so. See also
- * _bt_advance_array_keys.
+ * Mark the primitive index scan as done, unless some other process
+ * already did so. See also _bt_array_keys_remain.
*/
SpinLockAcquire(&btscan->btps_mutex);
- if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+ if (so->numPrimScans >= btscan->btps_numPrimScans &&
btscan->btps_pageStatus != BTPARALLEL_DONE)
{
btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -752,14 +761,14 @@ _bt_parallel_done(IndexScanDesc scan)
}
/*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- * keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ * counter when array keys are in use.
*
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
* scans.
*/
void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -768,13 +777,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- so->arrayKeyCount++;
+ so->numPrimScans++;
SpinLockAcquire(&btscan->btps_mutex);
if (btscan->btps_pageStatus == BTPARALLEL_DONE)
{
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount++;
+ btscan->btps_numPrimScans++;
}
SpinLockRelease(&btscan->btps_mutex);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5..d0abde584 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
*/
if (!so->qual_ok)
{
- /* Notify any other workers that we're done with this scan key. */
+ /* Notify any other workers that this primitive scan is done */
_bt_parallel_done(scan);
return false;
}
@@ -952,6 +952,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* one we use --- by definition, they are either redundant or
* contradictory.
*
+ * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+ * used to avoid prematurely stopping the scan when an array equality qual
+ * has its array keys advanced.
+ *
* Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
* If the index stores nulls at the end of the index we'll be starting
* from, and we have no boundary key for the column (which means the key
@@ -1537,9 +1541,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
BTPageOpaque opaque;
OffsetNumber minoff;
OffsetNumber maxoff;
+ BTReadPageState pstate;
int itemIndex;
- bool continuescan;
- int indnatts;
bool requiredMatchedByPrecheck;
/*
@@ -1560,8 +1563,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
}
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ pstate.dir = dir;
+ pstate.highkey = NULL;
+ pstate.continuescan = true; /* default assumption */
+ pstate.highkeychecked = false;
+
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
@@ -1609,9 +1615,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* the last item on the page would give a more precise answer.
*
* We skip this for the first page in the scan to evade the possible
- * slowdown of the point queries.
+ * slowdown of the point queries. Do the same with scans with array keys,
+ * since that makes the optimization unsafe (our search-type scan keys can
+ * change during any call to _bt_checkkeys whenever array keys are used).
*/
- if (!so->firstPage && minoff < maxoff)
+ if (!so->firstPage && minoff < maxoff && !so->numArrayKeys)
{
ItemId iid;
IndexTuple itup;
@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* set flag to true if all required keys are satisfied and false
* otherwise.
*/
- (void) _bt_checkkeys(scan, itup, indnatts, dir,
- &requiredMatchedByPrecheck, false);
+ _bt_checkkeys(scan, &pstate, itup, false, false);
+ requiredMatchedByPrecheck = pstate.continuescan;
+ pstate.continuescan = true; /* reset */
}
else
{
@@ -1636,6 +1645,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (ScanDirectionIsForward(dir))
{
+ /* SK_SEARCHARRAY scans must provide high key up front */
+ if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+ }
+
/* load items[] in ascending order */
itemIndex = 0;
@@ -1659,8 +1676,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, requiredMatchedByPrecheck);
+ passes_quals = _bt_checkkeys(scan, &pstate, itup, false,
+ requiredMatchedByPrecheck);
/*
* If the result of prechecking required keys was true, then in
@@ -1668,8 +1685,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* result is the same.
*/
Assert(!requiredMatchedByPrecheck ||
- passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, false));
+ passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+ false));
if (passes_quals)
{
/* tuple passes all scan key conditions */
@@ -1703,7 +1720,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
/* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
+ if (!pstate.continuescan)
break;
offnum = OffsetNumberNext(offnum);
@@ -1720,17 +1737,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* only appear on non-pivot tuples on the right sibling page are
* common.
*/
- if (continuescan && !P_RIGHTMOST(opaque))
+ if (pstate.continuescan && !P_RIGHTMOST(opaque))
{
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
+ IndexTuple itup;
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
+ if (pstate.highkey)
+ itup = pstate.highkey;
+ else
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+ }
+
+ _bt_checkkeys(scan, &pstate, itup, true, false);
}
- if (!continuescan)
+ if (!pstate.continuescan)
so->currPos.moreRight = false;
Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1751,6 +1774,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
IndexTuple itup;
bool tuple_alive;
bool passes_quals;
+ bool finaltup = (offnum == minoff);
/*
* If the scan specifies not to return killed tuples, then we
@@ -1761,12 +1785,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* tuple on the page, we do check the index keys, to prevent
* uselessly advancing to the page to the left. This is similar
* to the high key optimization used by forward scans.
+ *
+ * Separately, _bt_checkkeys actually requires that we call it
+ * with the final non-pivot tuple from the page, if there's one
+ * (final processed tuple, or first tuple in offset number terms).
+ * We must indicate which particular tuple comes last, too.
*/
if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
{
Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
+ if (!finaltup)
{
+ Assert(offnum > minoff);
offnum = OffsetNumberPrev(offnum);
continue;
}
@@ -1778,8 +1808,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, requiredMatchedByPrecheck);
+ passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+ requiredMatchedByPrecheck);
/*
* If the result of prechecking required keys was true, then in
@@ -1787,8 +1817,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* result is the same.
*/
Assert(!requiredMatchedByPrecheck ||
- passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, false));
+ passes_quals == _bt_checkkeys(scan, &pstate, itup,
+ finaltup, false));
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions */
@@ -1827,7 +1857,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
}
- if (!continuescan)
+ if (!pstate.continuescan)
{
/* there can't be any more matches, so stop */
so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1510b97fb..7adf76e12 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
typedef struct BTSortArrayContext
{
- FmgrInfo flinfo;
+ FmgrInfo *orderproc;
Oid collation;
bool reverse;
} BTSortArrayContext;
@@ -41,15 +41,35 @@ typedef struct BTSortArrayContext
static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
StrategyNumber strat,
Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
static int _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(ScanKey cur, FmgrInfo *orderproc,
+ Datum datum, bool null,
+ Datum arrdatum);
+static int _bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+ BTArrayKeyInfo *array, ScanKey cur,
+ FmgrInfo *orderproc, Datum datum, bool null,
+ int32 *final_result);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+ BTReadPageState *pstate,
+ IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir);
+static void _bt_preprocess_keys_leafbuf(IndexScanDesc scan);
static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
ScanKey leftarg, ScanKey rightarg,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger,
+ bool requiredMatchedByPrecheck);
static bool _bt_check_rowcompare(ScanKey skey,
IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
ScanDirection dir, bool *continuescan);
@@ -202,6 +222,11 @@ _bt_freestack(BTStack stack)
* array keys, it's sufficient to find the extreme element value and replace
* the whole array with that scalar value.
*
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved). It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
* Note: the reason we need so->arrayKeyData, rather than just scribbling
* on scan->keyData, is that callers are permitted to call btrescan without
* supplying a new set of scankey data.
@@ -212,6 +237,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int numberOfKeys = scan->numberOfKeys;
int16 *indoption = scan->indexRelation->rd_indoption;
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
int numArrayKeys;
ScanKey cur;
int i;
@@ -265,6 +291,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+ so->orderProcs = (FmgrInfo *) palloc0(nkeyatts * sizeof(FmgrInfo));
/* Now process each array key */
numArrayKeys = 0;
@@ -281,6 +308,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int j;
cur = &so->arrayKeyData[i];
+
+ /*
+ * Attributes with equality-type scan keys (including but not limited
+ * to array scan keys) will need a 3-way comparison function. Set
+ * that up now. (Avoids repeating work for the same attribute.)
+ */
+ if (cur->sk_strategy == BTEqualStrategyNumber &&
+ !OidIsValid(so->orderProcs[cur->sk_attno - 1].fn_oid))
+ _bt_sort_cmp_func_setup(scan, cur);
+
if (!(cur->sk_flags & SK_SEARCHARRAY))
continue;
@@ -436,6 +473,42 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
return result;
}
+/*
+ * Look up the appropriate comparison function in the opfamily.
+ *
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
+ */
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ Oid elemtype;
+ RegProcedure cmp_proc;
+ FmgrInfo *orderproc = &so->orderProcs[skey->sk_attno - 1];
+
+ /*
+ * Determine the nominal datatype of the array elements. We have to
+ * support the convention that sk_subtype == InvalidOid means the opclass
+ * input type; this is a hack to simplify life for ScanKeyInit().
+ */
+ elemtype = skey->sk_subtype;
+ if (elemtype == InvalidOid)
+ elemtype = rel->rd_opcintype[skey->sk_attno - 1];
+
+ cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+ rel->rd_opcintype[skey->sk_attno - 1],
+ elemtype,
+ BTORDER_PROC);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+ BTORDER_PROC, elemtype, elemtype,
+ rel->rd_opfamily[skey->sk_attno - 1]);
+ fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
/*
* _bt_sort_array_elements() -- sort and de-dup array elements
*
@@ -450,42 +523,14 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems)
{
- Relation rel = scan->indexRelation;
- Oid elemtype;
- RegProcedure cmp_proc;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
BTSortArrayContext cxt;
if (nelems <= 1)
return nelems; /* no work to do */
- /*
- * Determine the nominal datatype of the array elements. We have to
- * support the convention that sk_subtype == InvalidOid means the opclass
- * input type; this is a hack to simplify life for ScanKeyInit().
- */
- elemtype = skey->sk_subtype;
- if (elemtype == InvalidOid)
- elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
- /*
- * Look up the appropriate comparison function in the opfamily.
- *
- * Note: it's possible that this would fail, if the opfamily is
- * incomplete, but it seems quite unlikely that an opfamily would omit
- * non-cross-type support functions for any datatype that it supports at
- * all.
- */
- cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
- elemtype,
- elemtype,
- BTORDER_PROC);
- if (!RegProcedureIsValid(cmp_proc))
- elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
- BTORDER_PROC, elemtype, elemtype,
- rel->rd_opfamily[skey->sk_attno - 1]);
-
/* Sort the array elements */
- fmgr_info(cmp_proc, &cxt.flinfo);
+ cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
cxt.collation = skey->sk_collation;
cxt.reverse = reverse;
qsort_arg(elems, nelems, sizeof(Datum),
@@ -507,7 +552,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
int32 compare;
- compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+ compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
cxt->collation,
da, db));
if (cxt->reverse)
@@ -515,6 +560,167 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
return compare;
}
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This code is loosely based on _bt_compare. However, there are some
+ * important differences.
+ *
+ * It is convenient to think of calling _bt_compare as comparing caller's
+ * insertion scankey to an index tuple. But our callers are not searching
+ * through the index at all -- they're searching through a local array of
+ * datums associated with a scan key (using values they've taken from an index
+ * tuple). This is a complete reversal of how things usually work, which can
+ * be confusing.
+ *
+ * Callers of this function should think of it as comparing "datum" (as well
+ * as "null") to "arrdatum". This is the same approach that _bt_compare takes
+ * in that both functions compare the value that they're searching for to one
+ * particular item used as a binary search pivot. (But it's the wrong way
+ * around if you think of it as "tuple values vs scan key values". So don't.)
+*/
+static inline int32
+_bt_compare_array_skey(ScanKey cur,
+ FmgrInfo *orderproc,
+ Datum datum,
+ bool null,
+ Datum arrdatum)
+{
+ int32 result = 0;
+
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ if (cur->sk_flags & SK_ISNULL) /* array/scan key is NULL */
+ {
+ if (null)
+ result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NULL "<" NOT_NULL */
+ else
+ result = -1; /* NULL ">" NOT_NULL */
+ }
+ else if (null) /* array/scan key is NOT_NULL and tuple item
+ * is NULL */
+ {
+ if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NOT_NULL ">" NULL */
+ else
+ result = 1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * Like _bt_compare, we need to be careful of cross-type comparisons,
+ * so the left value has to be the value that came from an index
+ * tuple. (Array scan keys cannot be cross-type, but other required
+ * scan keys that use an equal operator can be.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+ datum, arrdatum));
+
+ /*
+ * Unlike _bt_compare, we flip the sign when column is a DESC column
+ * (and *not* when column is ASC). This matches the approach taken by
+ * _bt_check_rowcompare, which performs similar three-way comparisons.
+ */
+ if (cur->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan). This allows searches against required scan key arrays to
+ * reuse the work of earlier searches, at least in many important cases.
+ * Array keys covering key space that the index scan already processed cannot
+ * possibly contain any matches.
+ *
+ * Returns an index to the first array element >= caller's datum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to searched-for datum.
+ */
+static int
+_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+ BTArrayKeyInfo *array, ScanKey cur,
+ FmgrInfo *orderproc, Datum datum, bool null,
+ int32 *final_result)
+{
+ int low_elem,
+ high_elem,
+ first_elem_dir,
+ result = 0;
+ bool knownequal = false;
+
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ first_elem_dir = 0;
+ low_elem = array->cur_elem;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ low_elem = 0;
+ }
+ else
+ {
+ first_elem_dir = array->num_elems - 1;
+ low_elem = 0;
+ high_elem = array->cur_elem;
+ if (cur_elem_start)
+ {
+ low_elem = 0;
+ high_elem = first_elem_dir;
+ }
+ }
+
+ while (high_elem > low_elem)
+ {
+ int mid_elem = low_elem + ((high_elem - low_elem) / 2);
+ Datum arrdatum = array->elem_values[mid_elem];
+
+ result = _bt_compare_array_skey(cur, orderproc, datum, null, arrdatum);
+
+ if (result == 0)
+ {
+ /*
+ * Each array was deduplicated during initial preprocessing, so
+ * that each element is guaranteed to be unique. We can quit as
+ * soon as we see an equal array, saving ourselves an extra
+ * comparison or two...
+ */
+ low_elem = mid_elem;
+ knownequal = true;
+ break;
+ }
+
+ if (result > 0)
+ low_elem = mid_elem + 1;
+ else
+ high_elem = mid_elem;
+ }
+
+ /*
+ * ...but our caller also cares about the position of the searched-for
+ * datum relative to the low_elem match we'll return. Make sure that we
+ * set *final_result to the result that comes from comparing low_elem's
+ * key value to the datum that caller had us search for.
+ */
+ if (!knownequal)
+ result = _bt_compare_array_skey(cur, orderproc, datum, null,
+ array->elem_values[low_elem]);
+
+ *final_result = result;
+
+ return low_elem;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -539,76 +745,6 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
curArrayKey->cur_elem = 0;
skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
}
-
- so->arraysStarted = true;
-}
-
-/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
- *
- * Returns true if there is another set of values to consider, false if not.
- * On true result, the scankeys are initialized with the next set of values.
- */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
-{
- BTScanOpaque so = (BTScanOpaque) scan->opaque;
- bool found = false;
- int i;
-
- /*
- * We must advance the last array key most quickly, since it will
- * correspond to the lowest-order index column among the available
- * qualifications. This is necessary to ensure correct ordering of output
- * when there are multiple array keys.
- */
- for (i = so->numArrayKeys - 1; i >= 0; i--)
- {
- BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
- ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
- int cur_elem = curArrayKey->cur_elem;
- int num_elems = curArrayKey->num_elems;
-
- if (ScanDirectionIsBackward(dir))
- {
- if (--cur_elem < 0)
- {
- cur_elem = num_elems - 1;
- found = false; /* need to advance next array key */
- }
- else
- found = true;
- }
- else
- {
- if (++cur_elem >= num_elems)
- {
- cur_elem = 0;
- found = false; /* need to advance next array key */
- }
- else
- found = true;
- }
-
- curArrayKey->cur_elem = cur_elem;
- skey->sk_argument = curArrayKey->elem_values[cur_elem];
- if (found)
- break;
- }
-
- /* advance parallel scan */
- if (scan->parallel_scan != NULL)
- _bt_parallel_advance_array_keys(scan);
-
- /*
- * When no new array keys were found, the scan is "past the end" of the
- * array keys. _bt_start_array_keys can still "restart" the array keys if
- * a rescan is required.
- */
- if (!found)
- so->arraysStarted = false;
-
- return found;
}
/*
@@ -661,13 +797,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
* If we changed any keys, we must redo _bt_preprocess_keys. That might
* sound like overkill, but in cases with multiple keys per index column
* it seems necessary to do the full set of pushups.
- *
- * Also do this whenever the scan's set of array keys "wrapped around" at
- * the end of the last primitive index scan. There won't have been a call
- * to _bt_preprocess_keys from some other place following wrap around, so
- * we do it for ourselves.
*/
- if (changed || !so->arraysStarted)
+ if (changed)
{
_bt_preprocess_keys(scan);
/* The mark should have been set on a consistent set of keys... */
@@ -675,6 +806,694 @@ _bt_restore_array_keys(IndexScanDesc scan)
}
}
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) might need to advance the scan's array
+ * keys.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans). This means that it cannot possibly be time to advance the array
+ * keys just yet. _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfy our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans). This means that it might be time for our
+ * caller to advance the array keys to the next set.
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums. See
+ * comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ bool tuple_before_array_keys = false;
+ ScanKey cur;
+ int ntupatts = BTreeTupleGetNAtts(tuple, rel),
+ ikey;
+
+ Assert(so->qual_ok);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->numberOfKeys > 0);
+ Assert(!so->needPrimScan);
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ int attnum = cur->sk_attno;
+ FmgrInfo *orderproc;
+ Datum datum;
+ bool null,
+ skrequired;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required in the current scan
+ * direction
+ */
+ skrequired = ((ScanDirectionIsForward(dir) &&
+ (cur->sk_flags & SK_BT_REQFWD)) ||
+ (ScanDirectionIsBackward(dir) &&
+ (cur->sk_flags & SK_BT_REQBKWD)));
+
+ /*
+ * Unlike _bt_advance_array_keys, we never deal with any non-required
+ * array keys. Cases where skrequiredtrigger is set to false by
+ * _bt_check_compare should never call here. We are only called after
+ * _bt_check_compare provisionally indicated that the scan should be
+ * terminated due to a _required_ scan key not being satisfied.
+ *
+ * We expect _bt_check_compare to notice and report required scan keys
+ * before non-required ones. _bt_advance_array_keys might still have
+ * to advance non-required array keys in passing for a tuple that we
+ * were called for, but _bt_advance_array_keys doesn't rely on us to
+ * give it advanced notice of that.
+ */
+ if (!skrequired)
+ break;
+
+ if (attnum > ntupatts)
+ {
+ /*
+ * When we reach a high key's truncated attribute, assume that the
+ * tuple attribute's value is >= the scan's search-type scan keys
+ */
+ break;
+ }
+
+ datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+ orderproc = &so->orderProcs[attnum - 1];
+ result = _bt_compare_array_skey(cur, orderproc,
+ datum, null,
+ cur->sk_argument);
+
+ if (result != 0)
+ {
+ if (ScanDirectionIsForward(dir))
+ tuple_before_array_keys = result < 0;
+ else
+ tuple_before_array_keys = result > 0;
+
+ break;
+ }
+ }
+
+ return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if _bt_checkkeys determined that another primitive index scan
+ * must take place by calling _bt_first. Otherwise returns false, indicating
+ * that caller's top-level scan is now past the point where further matching
+ * index tuples can be found (for the current scan direction).
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ * All other scans should just call _bt_first once, no matter what.
+ *
+ * Top-level index scans executed via multiple primitive index scans must not
+ * fail to output index tuples in the usual order for the index -- just like
+ * any other index scan would. The state machine that manages the scan's
+ * array keys must only start primitive index scans when they cover key space
+ * strictly greater than the key space for tuples that the scan has already
+ * returned (or strictly less in the backwards scan case). Otherwise the scan
+ * could output the same index tuples more than once, or in the wrong order.
+ *
+ * This is managed by limiting the cases that can trigger new primitive index
+ * scans to those involving required array scan keys and/or other required
+ * scan keys that use the equality strategy. In particular, the state machine
+ * must not allow high order required scan keys using an inequality strategy
+ * (which are only required in one scan direction) to directly trigger a new
+ * primitive index scan that advances low order non-required array scan keys.
+ * For example, a query such as "SELECT thousand, tenthous FROM tenk1 WHERE
+ * thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand" whose execution
+ * involves a scan of an index on "(thousand, tenthous)" must perform no more
+ * than a single primitive index scan. Otherwise we risk outputting tuples in
+ * the wrong order. Array key values for the non-required scan key on the
+ * "tenthous" column must not dictate top-level scan order. Primitive index
+ * scans mustn't scan tuples already scanned by some earlier primitive scan.
+ *
+ * In fact, nbtree makes a stronger guarantee than is strictly necessary here:
+ * it guarantees that the top-level scan won't repeat any leaf page reads.
+ * (Actually, that can still happen when the scan is repositioned, or the scan
+ * direction changes -- but that's just as true with other types of scans.)
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(so->numArrayKeys);
+
+ /*
+ * Array keys are advanced within _bt_checkkeys when the scan reaches the
+ * leaf level (more precisely, they're advanced when the scan reaches the
+ * end of each distinct set of array elements). This process avoids
+ * repeat access to leaf pages (across multiple primitive index scans) by
+ * opportunistically advancing the scan's array keys when it allows the
+ * primitive index scan to find nearby matching tuples (or to eliminate
+ * array keys with no matching tuples from further consideration).
+ *
+ * _bt_checkkeys sets a simple flag variable that we check here. This
+ * tells us if we need to perform another primitive index scan for the
+ * now-current array keys or not. We'll unset the flag once again to
+ * acknowledge having started a new primitive scan (or we'll see that it
+ * isn't set and end the top-level scan right away).
+ *
+ * We cannot rely on _bt_first always reaching _bt_checkkeys here. There
+ * are various scenarios where that won't happen. For example, if the
+ * index is completely empty, then _bt_first won't get as far as calling
+ * _bt_readpage/_bt_checkkeys.
+ *
+ * We also don't expect _bt_checkkeys to be reached when searching for a
+ * non-existent value that happens to be higher than any existing value in
+ * the index. No _bt_checkkeys are expected when _bt_readpage reads the
+ * rightmost page during such a scan -- even a _bt_checkkeys call against
+ * the high key won't happen. There is an analogous issue for backwards
+ * scans that search for a value lower than all existing index tuples.
+ *
+ * We don't actually require special handling for these cases -- we don't
+ * need to be explicitly instructed to _not_ perform another primitive
+ * index scan. This is correct for all of the cases we've listed so far,
+ * which all involve primitive index scans that access pages "near the
+ * boundaries of the key space" (the leftmost page, the rightmost page, or
+ * an imaginary empty leaf root page). If _bt_checkkeys cannot be reached
+ * by a primitive index scan for one set of array keys, it follows that it
+ * also won't be reached for any later set of array keys.
+ *
+ * There is one exception: the case where _bt_first's _bt_preprocess_keys
+ * call determined that the scan's input scan keys can never be satisfied.
+ * That might be true for one set of array keys, but not the next set.
+ */
+ if (!so->qual_ok)
+ {
+ /*
+ * Qual can never be satisfied. Advance our array keys incrementally.
+ */
+ so->needPrimScan = false;
+ if (_bt_advance_array_keys_increment(scan, dir))
+ return true;
+ }
+
+ /* Time for another primitive index scan? */
+ if (so->needPrimScan)
+ {
+ /* Begin primitive index scan */
+ so->needPrimScan = false;
+
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_next_primitive_scan(scan);
+
+ return true;
+ }
+
+ /*
+ * No more primitive index scans. Just terminate the top-level scan.
+ */
+ _bt_advance_array_keys_to_end(scan, dir);
+
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_done(scan);
+
+ return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false. When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple. It
+ * is okay to let the second call set pstate.continuescan=false without
+ * further intervention, since we know that it can only be for a scan key that
+ * is required in one direction.
+ *
+ * When called with skrequiredtrigger, we don't expect to have to advance any
+ * non-required scan keys. We'll always set pstate.continuescan because a
+ * non-required scan key can never terminate the scan.
+ *
+ * Required array keys are always advanced to the highest element >= the
+ * corresponding tuple attribute values for its most significant non-equal
+ * column (or the next lowest set <= the tuple value during backwards scans).
+ * If we reach the end of the array keys for the current scan direction, we
+ * end the top-level index scan.
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys (or <= during backward
+ * scans). This must be established first, before calling here.
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple. We fall back on "incrementally" advancing the
+ * array keys in these cases, which involve inequality strategy scan keys.
+ * For example, with a composite index on (a, b) and a qual "WHERE a IN (3,5)
+ * AND b < 42", we'll be called for both "a" keys (i.e. keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42". Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called. The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)". The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time. This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys. We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are. We must not suppress cases where a call to
+ * _bt_check_compare sets continuescan=false for a required scan key that uses
+ * the equality strategy (only inequality-type scan keys get that treatment).
+ * We don't want to suppress the scan's termination when it's inappropriate.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0,
+ ntupatts = BTreeTupleGetNAtts(tuple, rel);
+ bool arrays_advanced = false,
+ all_skrequired_atts_wrapped = skrequiredtrigger,
+ all_atts_equal = true,
+ arrays_done;
+
+ Assert(so->numberOfKeys > 0);
+ Assert(so->numArrayKeys > 0);
+ Assert(so->qual_ok);
+
+ /*
+ * Try to advance array keys via a series of binary searches.
+ *
+ * Loop iterates through the current scankeys (so->keyData, which were
+ * output by _bt_preprocess_keys earlier) and then sets input scan keys
+ * (so->arrayKeyData scan keys) to new array values.
+ */
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array = NULL;
+ ScanKey skeyarray = NULL;
+ FmgrInfo *orderproc;
+ int attnum = cur->sk_attno,
+ first_elem_dir,
+ final_elem_dir,
+ set_elem;
+ Datum datum;
+ bool skrequired,
+ null;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required in the current scan
+ * direction
+ */
+ skrequired = ((ScanDirectionIsForward(dir) &&
+ (cur->sk_flags & SK_BT_REQFWD)) ||
+ (ScanDirectionIsBackward(dir) &&
+ (cur->sk_flags & SK_BT_REQBKWD)));
+
+ /*
+ * Optimization: we don't have to advance remaining non-required array
+ * keys when we already know that tuple won't be returned by the scan.
+ *
+ * Deliberately check this both here and after the binary search.
+ */
+ if (!skrequired && !all_atts_equal)
+ break;
+
+ /*
+ * We need to check required non-array scan keys (that use the equal
+ * strategy), as well as required and non-required array scan keys
+ * (also limited to those that use the equal strategy, since array
+ * inequalities degenerate into a simple comparison).
+ *
+ * Perform initial set up for this scan key. If it is backed by an
+ * array then we need to set variables describing the current position
+ * in the array.
+ */
+ orderproc = &so->orderProcs[attnum - 1];
+ first_elem_dir = final_elem_dir = 0; /* keep compiler quiet */
+ if (cur->sk_flags & SK_SEARCHARRAY)
+ {
+ /* Set up array comparison function */
+ Assert(arrayidx < so->numArrayKeys);
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+ Assert(skeyarray->sk_attno == attnum);
+
+ /* Proactively set up state used to handle array wraparound */
+ if (ScanDirectionIsForward(dir))
+ {
+ first_elem_dir = 0;
+ final_elem_dir = array->num_elems - 1;
+ }
+ else
+ {
+ first_elem_dir = array->num_elems - 1;
+ final_elem_dir = 0;
+ }
+ }
+ else if (attnum > ntupatts)
+ {
+ /*
+ * Nothing needs to be done when we have a truncated attribute
+ * (possible when caller's tuple is a page high key) and a
+ * non-array scan key
+ */
+ Assert(ScanDirectionIsForward(dir));
+ continue;
+ }
+
+ /*
+ * Here we perform steps for any required scan keys after the first
+ * non-equal required scan key. The first scan key must have been set
+ * to a value > the value from the tuple back when we dealt with it
+ * (or, for a backwards scan, to a value < the value from the tuple).
+ * That needs to "cascade" to lower-order array scan keys. They must
+ * be set to the first array element for the current scan direction.
+ *
+ * We're still setting the keys to values >= the tuple here -- it just
+ * needs to work for the tuple as a whole. For example, when a tuple
+ * "(a, b) = (42, 5)" advances the array keys on "a" from 40 to 45, we
+ * must also set "b" to whatever the first array element for "b" is.
+ * It would be wrong to allow "b" to be set to a value from the tuple,
+ * since the value is actually from a different part of the key space.
+ *
+ * Also defensively do this with truncated attributes when caller's
+ * tuple is a page high key.
+ */
+ if (array && ((arrays_advanced && !all_atts_equal) ||
+ attnum > ntupatts))
+ {
+ /*
+ * We set the array to the first element (if needed) here, and we
+ * don't unset all_required_atts_wrapped. This array therefore
+ * counts as a wrapped array when we go on to determine if all of
+ * the required arrays have wrapped (after this loop).
+ */
+ if (array->cur_elem != first_elem_dir)
+ {
+ array->cur_elem = first_elem_dir;
+ skeyarray->sk_argument = array->elem_values[first_elem_dir];
+ arrays_advanced = true;
+ }
+
+ continue;
+ }
+
+ /*
+ * Going to compare scan key to corresponding tuple attribute value
+ */
+ datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+ if (!array)
+ {
+ if (!skrequired || !all_atts_equal)
+ continue;
+
+ /*
+ * This is a required non-array scan key that uses the equal
+ * strategy. See header comments for an explanation of why we
+ * need to do this.
+ */
+ result = _bt_compare_array_skey(cur, orderproc, datum, null,
+ cur->sk_argument);
+
+ if (result != 0)
+ {
+ /*
+ * tuple attribute value is > scan key value (or < scan key
+ * value in the backward scan case).
+ */
+ all_atts_equal = false;
+ break;
+ }
+
+ continue;
+ }
+
+ /*
+ * Binary search for an array key >= the tuple value, which we'll then
+ * set as our current array key (or <= the tuple value if this is a
+ * backward scan).
+ *
+ * The binary search excludes array keys that we've already processed
+ * from consideration, except with a non-required scan key's array.
+ * This is not just an optimization -- it's important for correctness.
+ * It is crucial that required array scan keys only have their array
+ * keys advanced in the current scan direction. We need to advance
+ * required array keys in lock step with the index scan.
+ *
+ * Note in particular that arrays_advanced must only be set when the
+ * array is advanced to a key >= the existing key, or <= for a
+ * backwards scan. (Though see notes about wraparound below.)
+ */
+ set_elem = _bt_binsrch_array_skey(dir, (!skrequired || arrays_advanced),
+ array, cur, orderproc, datum, null,
+ &result);
+
+ /*
+ * Maintain the state that tracks whether all attribute from the tuple
+ * are equal to the array keys that we've set as current (or existing
+ * array keys set during earlier calls here).
+ */
+ if (result != 0)
+ all_atts_equal = false;
+
+ /*
+ * Optimization: we don't have to advance remaining non-required array
+ * keys when we already know that tuple won't be returned by the scan.
+ * Quit before setting the array keys to avoid _bt_preprocess_keys.
+ *
+ * Deliberately check this both before and after the binary search.
+ */
+ if (!skrequired && !all_atts_equal)
+ break;
+
+ /*
+ * If the binary search indicates that the key space for this tuple
+ * attribute value is > the key value from the final element in the
+ * array (final for the current scan direction), we handle it by
+ * wrapping around to the first element of the array.
+ *
+ * Wrapping around simplifies advancement with a multi-column index by
+ * allowing us to treat wrapping a column as advancing the column. We
+ * preserve the invariant that a required scan key's array may only be
+ * ratcheted forward (backwards when the scan direction is backwards),
+ * while still always being able to "advance" the array at this point.
+ */
+ if (set_elem == final_elem_dir &&
+ ((ScanDirectionIsForward(dir) && result > 0) ||
+ (ScanDirectionIsBackward(dir) && result < 0)))
+ {
+ /* Perform wraparound */
+ set_elem = first_elem_dir;
+ }
+ else if (skrequired)
+ {
+ /* Won't call _bt_advance_array_keys_to_end later */
+ all_skrequired_atts_wrapped = false;
+ }
+
+ Assert(set_elem >= 0 && set_elem < array->num_elems);
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ skeyarray->sk_argument = array->elem_values[set_elem];
+ arrays_advanced = true;
+
+ /*
+ * We shouldn't have to advance a required array when called due
+ * to _bt_check_compare determining that a non-required array
+ * needs to be advanced. We expect _bt_check_compare to notice
+ * and report required scan keys before non-required ones.
+ */
+ Assert(skrequiredtrigger || !skrequired);
+ }
+ }
+
+ /*
+ * Finalize details of array key advancement
+ */
+ arrays_done = false;
+ if (!skrequiredtrigger)
+ {
+ /*
+ * Failing to satisfy a non-required array scan key shouldn't ever
+ * result in terminating the (primitive) index scan
+ */
+ }
+ else if (all_skrequired_atts_wrapped)
+ {
+ /*
+ * The binary searches for each tuple's attribute value in the scan
+ * key's corresponding SK_SEARCHARRAY array all found that the tuple's
+ * value are "past the end" of the key space covered by each array
+ */
+ _bt_advance_array_keys_to_end(scan, dir);
+ arrays_done = true;
+ all_atts_equal = false; /* at least not now */
+ }
+ else if (!arrays_advanced)
+ {
+ /*
+ * We must always advance the array keys by at least one increment
+ * (except when called to advance a non-required scan key's array).
+ *
+ * We need this fallback for cases where the existing array keys and
+ * existing required equal-strategy scan keys were fully equal to the
+ * tuple. _bt_check_compare may have set continuescan=false due to an
+ * inequality terminating the scan, which we don't deal with directly.
+ * (See function's header comments for an example.)
+ */
+ if (_bt_advance_array_keys_increment(scan, dir))
+ arrays_advanced = true;
+ else
+ arrays_done = true;
+ all_atts_equal = false; /* at least not now */
+ }
+
+ /*
+ * If we changed the array keys (without exhausting all array keys), then
+ * we must now perform a targeted form of in-place preprocessing of the
+ * scan's search-type scan keys. This updates the array scan keys in
+ * place. It doesn't try to eliminate redundant keys, nor can it detect
+ * contradictory quals.
+ */
+ if (arrays_advanced && !arrays_done)
+ _bt_preprocess_keys_leafbuf(scan);
+
+ /*
+ * If we haven't yet exhausted all required array scan keys, the primitive
+ * scan continues for now. Note that the !all_atts_equal case will have
+ * another call to _bt_check_compare right away, which will overwrite
+ * continuescan right away.
+ *
+ * If any required array keys changed, it makes sense to check the high
+ * key to terminate the scan early (the fact that it might not have worked
+ * with previous array keys and earlier tuples tells us nothing about what
+ * might work with new array keys and later index tuples).
+ */
+ pstate->continuescan = !arrays_done;
+ if (arrays_advanced && skrequiredtrigger)
+ pstate->highkeychecked = false;
+
+ return all_atts_equal;
+}
+
+/*
+ * Advance the array keys by a single increment in the current scan direction
+ */
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool found = false;
+
+ Assert(!so->needPrimScan);
+
+ /*
+ * We must advance the last array key most quickly, since it will
+ * correspond to the lowest-order index column among the available
+ * qualifications. This is necessary to ensure correct ordering of output
+ * when there are multiple array keys.
+ */
+ for (int i = so->numArrayKeys - 1; i >= 0; i--)
+ {
+ BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
+ int cur_elem = curArrayKey->cur_elem;
+ int num_elems = curArrayKey->num_elems;
+
+ if (ScanDirectionIsBackward(dir))
+ {
+ if (--cur_elem < 0)
+ {
+ cur_elem = num_elems - 1;
+ found = false; /* need to advance next array key */
+ }
+ else
+ found = true;
+ }
+ else
+ {
+ if (++cur_elem >= num_elems)
+ {
+ cur_elem = 0;
+ found = false; /* need to advance next array key */
+ }
+ else
+ found = true;
+ }
+
+ curArrayKey->cur_elem = cur_elem;
+ skey->sk_argument = curArrayKey->elem_values[cur_elem];
+ if (found)
+ break;
+ }
+
+ return found;
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ */
+static void
+_bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(so->numArrayKeys);
+ Assert(!so->needPrimScan);
+
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
+ int reset_elem;
+
+ if (ScanDirectionIsForward(dir))
+ reset_elem = curArrayKey->num_elems - 1;
+ else
+ reset_elem = 0;
+
+ if (curArrayKey->cur_elem != reset_elem)
+ {
+ curArrayKey->cur_elem = reset_elem;
+ skey->sk_argument = curArrayKey->elem_values[reset_elem];
+ }
+ }
+}
/*
* _bt_preprocess_keys() -- Preprocess scan keys
@@ -749,6 +1568,21 @@ _bt_restore_array_keys(IndexScanDesc scan)
* Again, missing cross-type operators might cause us to fail to prove the
* quals contradictory when they really are, but the scan will work correctly.
*
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here. They
+ * expect to be able to call _bt_preprocess_keys_leafbuf instead (a stripped
+ * down version of this function that's specialized to array key index scans).
+ * We need to be careful about that case here when we determine redundancy;
+ * equality quals must not be eliminated as redundant on the basis of array
+ * input keys that might change before another call here takes place.
+ *
+ * Note, however, that the presence of an array scan key doesn't affect how we
+ * determine if index quals are contradictory. Contradictory qual scans move
+ * on to the next primitive index scan right away, by incrementing the scan's
+ * array keys once control reaches _bt_array_keys_remain. There won't ever be
+ * a call to _bt_preprocess_keys_leafbuf before the next call here, so there
+ * is nothing for us to break.
+ *
* Row comparison keys are currently also treated without any smarts:
* we just transfer them into the preprocessed array without any
* editorialization. We can treat them the same as an ordinary inequality
@@ -895,8 +1729,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
so->qual_ok = false;
return;
}
- /* else discard the redundant non-equality key */
- xform[j] = NULL;
+ else if (!(eq->sk_flags & SK_SEARCHARRAY))
+ {
+ /* else discard the redundant non-equality key */
+ xform[j] = NULL;
+ }
}
/* else, cannot determine redundancy, keep both keys */
}
@@ -994,12 +1831,28 @@ _bt_preprocess_keys(IndexScanDesc scan)
}
else
{
- /* yup, keep only the more restrictive key */
+ /* yup, keep only the more restrictive non-equality key */
if (_bt_compare_scankey_args(scan, cur, cur, xform[j],
&test_result))
{
if (test_result)
- xform[j] = cur;
+ {
+ if (j == (BTEqualStrategyNumber - 1))
+ {
+ /*
+ * Keep redundant = operators so that array scan keys
+ * will always be present, as expected by our sibling
+ * _bt_preprocess_keys_leafbuf function.
+ */
+ ScanKey outkey = &outkeys[new_numberOfKeys++];
+
+ memcpy(outkey, cur, sizeof(ScanKeyData));
+ if (numberOfEqualCols == attno - 1)
+ _bt_mark_scankey_required(outkey);
+ }
+ else
+ xform[j] = cur;
+ }
else if (j == (BTEqualStrategyNumber - 1))
{
/* key == a && key == b, but a != b */
@@ -1027,6 +1880,51 @@ _bt_preprocess_keys(IndexScanDesc scan)
so->numberOfKeys = new_numberOfKeys;
}
+/*
+ * _bt_preprocess_keys_leafbuf() -- Preprocess array scan keys only
+ *
+ * Stripped down version of _bt_preprocess_keys that can be called with a
+ * buffer lock held. Reuses much of the work performed during the previous
+ * _bt_preprocess_keys call.
+ *
+ * This function just transfers newly advanced array keys that were set in
+ * "so->arrayKeyData" to corresponding "so->keyData" search-type scan keys.
+ * It does not independently detect redunant or contradictory scan keys. This
+ * makes little difference in practice -- we rely on _bt_preprocess_keys calls
+ * from _bt_first to get most of the available benefit.
+ */
+static void
+_bt_preprocess_keys_leafbuf(IndexScanDesc scan)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0;
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array;
+ ScanKey skeyarray;
+
+ /* Just update equality array scan keys */
+ if (cur->sk_strategy != BTEqualStrategyNumber ||
+ !(cur->sk_flags & SK_SEARCHARRAY))
+ continue;
+
+ Assert(arrayidx < so->numArrayKeys);
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+
+ /*
+ * Update the scan key's argument, but nothing more
+ */
+ Assert(cur->sk_attno == skeyarray->sk_attno);
+ cur->sk_argument = skeyarray->sk_argument;
+ }
+
+ Assert(arrayidx == so->numArrayKeys);
+}
+
/*
* Compare two scankey values using a specified operator.
*
@@ -1360,41 +2258,209 @@ _bt_mark_scankey_required(ScanKey skey)
*
* Return true if so, false if not. If the tuple fails to pass the qual,
* we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
+ * this tuple, and set pstate.continuescan accordingly. See comments for
* _bt_preprocess_keys(), above, about how this is done.
*
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the high
+ * key early, before we've expended too much effort on comparing tuples that
+ * cannot possibly be matches for any set of array keys. This is just an
+ * optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate. These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards). Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction). Any other order will
+ * lead to inconsistent array key state.
*
* scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
* tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
* requiredMatchedByPrecheck: indicates that scan keys required for
* direction scan are already matched
*/
bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup,
bool requiredMatchedByPrecheck)
{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
+ TupleDesc tupdesc = RelationGetDescr(scan->indexRelation);
+ int natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool res;
+ bool skrequiredtrigger;
+
+ Assert(so->qual_ok);
+ Assert(pstate->continuescan);
+ Assert(!so->needPrimScan);
+
+ res = _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger,
+ requiredMatchedByPrecheck);
+
+ /*
+ * Only one _bt_check_compare call is required in the common case where
+ * there are no equality-type array scan keys.
+ *
+ * When there are array scan keys then we can still accept the first
+ * answer we get from _bt_check_compare when continuescan wasn't unset.
+ */
+ if (!so->numArrayKeys || pstate->continuescan)
+ return res;
+
+ /*
+ * _bt_check_compare set continuescan=false in the presence of equality
+ * type array keys. It's possible that we haven't reached the start of
+ * the array keys just yet. It's also possible that we need to advance
+ * the array keys now. (Or perhaps we really do need to terminate the
+ * top-level scan.)
+ */
+ pstate->continuescan = true; /* new initial assumption */
+
+ if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+ {
+ /*
+ * Tuple is still < the current array scan key values (as well as
+ * other equality type scan keys) if this is a forward scan.
+ * (Backwards scans reach here with a tuple > equality constraints.)
+ * We must now consider how to proceed with the ongoing primitive
+ * index scan.
+ *
+ * Should _bt_readpage continue with this page for now, in the hope of
+ * finding tuples whose key space is covered by the current array keys
+ * before too long? Or, should it give up and start a new primitive
+ * index scan instead?
+ *
+ * Our policy is to terminate the primitive index scan at the end of
+ * the current page if the current (most recently advanced) array keys
+ * don't cover the final tuple from the page. This policy is fairly
+ * conservative.
+ *
+ * Note: In some cases we're effectively speculating that the next
+ * sibling leaf page will have tuples that are covered by the key
+ * space of our array keys (the current set or some nearby set), based
+ * on a cue from the current page's final tuple. There is at least a
+ * non-zero risk of wasting a page access -- we could gamble and lose.
+ * The details of all this are handled within _bt_advance_array_keys.
+ */
+ if (finaltup || (!pstate->highkeychecked && pstate->highkey &&
+ _bt_tuple_before_array_skeys(scan, pstate,
+ pstate->highkey)))
+ {
+ /*
+ * This is the final tuple (the high key for forward scans, or the
+ * tuple at the first offset number for backward scans), but it is
+ * still before the current array keys. As such, we're unwilling
+ * to allow the current primitive index scan to continue to the
+ * next leaf page.
+ *
+ * Start a new primitive index scan. The next primitive index
+ * scan (in the next _bt_first call) is expected to reposition the
+ * scan to some much later leaf page. (If we had a good reason to
+ * think that the next leaf page that will be scanned will turn
+ * out to be close to our current position, then we wouldn't be
+ * starting another primitive index scan.)
+ *
+ * Note: _bt_readpage stashes the page high key, which allows us
+ * to make this check early (for forward scans). We thereby avoid
+ * scanning very many extra tuples on the page. This is just an
+ * optimization; skipping these useless comparisons should never
+ * change our final conclusion about what the scan should do next.
+ */
+ pstate->continuescan = false;
+ so->needPrimScan = true;
+ }
+ else if (!finaltup && pstate->highkey)
+ {
+ /*
+ * Remember that the high key has been checked with this
+ * particular set of array keys.
+ *
+ * It might make sense to check the same high key again at some
+ * point during the ongoing _bt_readpage-wise scan of this page.
+ * But it is definitely wasteful to repeat the same high key check
+ * before the array keys are advanced by some later tuple.
+ */
+ pstate->highkeychecked = true;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual
+ */
+ return false;
+ }
+
+ /*
+ * Caller's tuple is >= the current set of array keys and other equality
+ * constraint scan keys (or <= if this is a backwards scans).
+ *
+ * It might be time to advance the array keys to the next set. Try doing
+ * that now, while determining in passing if the tuple matches the newly
+ * advanced set of array keys (if we've any left).
+ *
+ * This call will also set continuescan for us (or tells us to perform
+ * another _bt_check_compare call, which then sets continuescan for us).
+ */
+ if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+ {
+ /*
+ * Tuple doesn't match any later array keys, either. Give up on this
+ * tuple being a match. (Call may have also terminated the primitive
+ * scan, or the top-level scan.)
+ */
+ return false;
+ }
+
+ /*
+ * Advanced array keys to values that are exact matches for corresponding
+ * attribute values from the tuple.
+ *
+ * It's fairly likely that the tuple satisfies all index scan conditions
+ * at this point, but we need confirmation of that. We also need to give
+ * _bt_check_compare a real opportunity to end the top-level index scan by
+ * setting continuescan=false. (_bt_advance_array_keys cannot deal with
+ * inequality strategy scan keys; we need _bt_check_compare for those.)
+ */
+ return _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger,
+ false);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys. It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan. It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger,
+ bool requiredMatchedByPrecheck)
+{
int ikey;
ScanKey key;
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+ Assert(!so->numArrayKeys || !requiredMatchedByPrecheck);
*continuescan = true; /* default assumption */
+ *skrequiredtrigger = true; /* default assumption */
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ for (key = so->keyData, ikey = 0; ikey < so->numberOfKeys; key++, ikey++)
{
Datum datum;
bool isNull;
@@ -1526,7 +2592,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* _bt_first() except for the NULLs checking, which have already done
* above.
*/
- if (!requiredOppositeDir)
+ if (!requiredOppositeDir || so->numArrayKeys)
{
test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
datum, key->sk_argument);
@@ -1549,10 +2615,22 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* qual fails, it is critical that equality quals be used for the
* initial positioning in _bt_first() when they are available. See
* comments in _bt_first().
+ *
+ * Scans with equality-type array scan keys run into a similar
+ * problem whenever they advance the array keys. Our caller uses
+ * _bt_tuple_before_array_skeys to avoid the problem there.
*/
if (requiredSameDir)
*continuescan = false;
+ if ((key->sk_flags & SK_SEARCHARRAY) &&
+ key->sk_strategy == BTEqualStrategyNumber)
+ {
+ if (!requiredSameDir)
+ *skrequiredtrigger = false;
+ *continuescan = false;
+ }
+
/*
* In any case, this indextuple doesn't match the qual.
*/
@@ -1571,7 +2649,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* it's not possible for any future tuples in the current scan direction
* to pass the qual.
*
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
*/
static bool
_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..f04ca1ee9 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop);
+ bool *skip_nonnative_saop);
static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
List *clauses, List *other_clauses);
static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
* index AM supports them natively, we should just include them in simple
* index paths. If not, we should exclude them while building simple index
* paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
*/
static void
get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
{
List *indexpaths;
bool skip_nonnative_saop = false;
- bool skip_lower_saop = false;
ListCell *lc;
/*
* Build simple index paths using the clauses. Allow ScalarArrayOpExpr
- * clauses only if the index AM supports them natively, and skip any such
- * clauses for index columns after the first (so that we produce ordered
- * paths if possible).
+ * clauses only if the index AM supports them natively.
*/
indexpaths = build_index_paths(root, rel,
index, clauses,
index->predOK,
ST_ANYSCAN,
- &skip_nonnative_saop,
- &skip_lower_saop);
-
- /*
- * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
- * that supports them, then try again including those clauses. This will
- * produce paths with more selectivity but no ordering.
- */
- if (skip_lower_saop)
- {
- indexpaths = list_concat(indexpaths,
- build_index_paths(root, rel,
- index, clauses,
- index->predOK,
- ST_ANYSCAN,
- &skip_nonnative_saop,
- NULL));
- }
+ &skip_nonnative_saop);
/*
* Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
index, clauses,
false,
ST_BITMAPSCAN,
- NULL,
NULL);
*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
* to true if we found any such clauses (caller must initialize the variable
* to false). If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
*
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false). If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
* 'rel' is the index's heap relation
* 'index' is the index for which we want to generate paths
* 'clauses' is the collection of indexable clauses (IndexClause nodes)
* 'useful_predicate' indicates whether the index has a useful predicate
* 'scantype' indicates whether we need plain or bitmap scan support
* 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
*/
static List *
build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop)
+ bool *skip_nonnative_saop)
{
List *result = NIL;
IndexPath *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
- bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
bool index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
* on by btree and possibly other places.) The list can be empty, if the
* index AM allows that.
*
- * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
- * index clause for a non-first index column. This prevents us from
- * assuming that the scan result is ordered. (Actually, the result is
- * still ordered if there are equality constraints for all earlier
- * columns, but it seems too expensive and non-modular for this code to be
- * aware of that refinement.)
- *
* We also build a Relids set showing which outer rels are required by the
* selected clauses. Any lateral_relids are included in that, but not
* otherwise accounted for.
*/
index_clauses = NIL;
- found_lower_saop_clause = false;
outer_relids = bms_copy(rel->lateral_relids);
for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
{
@@ -917,16 +876,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/* Caller had better intend this only for bitmap scan */
Assert(scantype == ST_BITMAPSCAN);
}
- if (indexcol > 0)
- {
- if (skip_lower_saop)
- {
- /* Caller doesn't want to lose index ordering */
- *skip_lower_saop = true;
- continue;
- }
- found_lower_saop_clause = true;
- }
}
/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/*
* 2. Compute pathkeys describing index's ordering, if any, then see how
* many of them are actually useful for this query. This is not relevant
- * if we are only trying to build bitmap indexscans, nor if we have to
- * assume the scan is unordered.
+ * if we are only trying to build bitmap indexscans.
*/
pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
- !found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
index, &clauseset,
useful_predicate,
ST_BITMAPSCAN,
- NULL,
NULL);
result = list_concat(result, indexpaths);
}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..c796b53a6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
double numIndexTuples;
double spc_random_page_cost;
double num_sa_scans;
- double num_outer_scans;
- double num_scans;
double qual_op_cost;
double qual_arg_cost;
List *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
/*
* Check for ScalarArrayOpExpr index quals, and estimate the number of
- * index scans that will be performed.
+ * primitive index scans that will be performed for caller
*/
num_sa_scans = 1;
foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
*/
numIndexTuples = costs->numIndexTuples;
if (numIndexTuples <= 0.0)
- {
numIndexTuples = indexSelectivity * index->rel->tuples;
- /*
- * The above calculation counts all the tuples visited across all
- * scans induced by ScalarArrayOpExpr nodes. We want to consider the
- * average per-indexscan number, so adjust. This is a handy place to
- * round to integer, too. (If caller supplied tuple estimate, it's
- * responsible for handling these considerations.)
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
- }
-
/*
* We can bound the number of tuples by the index size in any case. Also,
* always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
*
* The above calculations are all per-index-scan. However, if we are in a
* nestloop inner scan, we can expect the scan to be repeated (with
- * different search keys) for each row of the outer relation. Likewise,
- * ScalarArrayOpExpr quals result in multiple index scans. This creates
- * the potential for cache effects to reduce the number of disk page
- * fetches needed. We want to estimate the average per-scan I/O cost in
- * the presence of caching.
+ * different search keys) for each row of the outer relation. This
+ * creates the potential for cache effects to reduce the number of disk
+ * page fetches needed. We want to estimate the average per-scan I/O cost
+ * in the presence of caching.
*
* We use the Mackert-Lohman formula (see costsize.c for details) to
* estimate the total number of page fetches that occur. While this
* wasn't what it was designed for, it seems a reasonable model anyway.
* Note that we are counting pages not tuples anymore, so we take N = T =
* index size, as if there were one "tuple" per page.
+ *
+ * Note: we assume that there will be no repeat index page fetches across
+ * ScalarArrayOpExpr primitive scans from the same logical index scan.
+ * This is guaranteed to be true for btree indexes, but is very optimistic
+ * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+ * However, these same index AMs also accept our default pessimistic
+ * approach to counting num_sa_scans (btree caller caps this), so we don't
+ * expect the final indexTotalCost to be wildly over-optimistic.
*/
- num_outer_scans = loop_count;
- num_scans = num_sa_scans * num_outer_scans;
-
- if (num_scans > 1)
+ if (loop_count > 1)
{
double pages_fetched;
/* total page fetches ignoring cache effects */
- pages_fetched = numIndexPages * num_scans;
+ pages_fetched = numIndexPages * loop_count;
/* use Mackert and Lohman formula to adjust for cache effects */
pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
/*
* Now compute the total disk access cost, and then report a pro-rated
- * share for each outer scan. (Don't pro-rate for ScalarArrayOpExpr,
- * since that's internal to the indexscan.)
+ * share for each outer scan
*/
- indexTotalCost = (pages_fetched * spc_random_page_cost)
- / num_outer_scans;
+ indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
}
else
{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
* evaluated once at the start of the scan to reduce them to runtime keys
* to pass to the index AM (see nodeIndexscan.c). We model the per-tuple
* CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
- * indexqual operator. Because we have numIndexTuples as a per-scan
- * number, we have to multiply by num_sa_scans to get the correct result
- * for ScalarArrayOpExpr cases. Similarly add in costs for any index
- * ORDER BY expressions.
+ * indexqual operator. Similarly add in costs for any index ORDER BY
+ * expressions.
*
* Note: this neglects the possible costs of rechecking lossy operators.
* Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
indexStartupCost = qual_arg_cost;
indexTotalCost += qual_arg_cost;
- indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+ indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
/*
* Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
bool eqQualHere;
bool found_saop;
bool found_is_null_op;
- double num_sa_scans;
ListCell *lc;
/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
*
* For a RowCompareExpr, we consider only the first column, just as
* rowcomparesel() does.
- *
- * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
- * index scans not one, but the ScalarArrayOpExpr's operator can be
- * considered to act the same as it normally does.
*/
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
found_saop = false;
found_is_null_op = false;
- num_sa_scans = 1;
foreach(lc, path->indexclauses)
{
IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
else if (IsA(clause, ScalarArrayOpExpr))
{
ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
- Node *other_operand = (Node *) lsecond(saop->args);
- int alength = estimate_array_length(other_operand);
clause_op = saop->opno;
found_saop = true;
- /* count number of SA scans induced by indexBoundQuals only */
- if (alength > 1)
- num_sa_scans *= alength;
}
else if (IsA(clause, NullTest))
{
@@ -6805,9 +6781,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
Selectivity btreeSelectivity;
/*
- * If the index is partial, AND the index predicate with the
- * index-bound quals to produce a more accurate idea of the number of
- * rows covered by the bound conditions.
+ * AND the index predicate with the index-bound quals to produce a
+ * more accurate idea of the number of rows covered by the bound
+ * conditions
*/
selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
JOIN_INNER,
NULL);
numIndexTuples = btreeSelectivity * index->rel->tuples;
-
- /*
- * As in genericcostestimate(), we have to adjust for any
- * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
- * to integer.
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
}
/*
@@ -6832,6 +6801,43 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
genericcostestimate(root, path, loop_count, &costs);
+ /*
+ * Now compensate for btree's ability to efficiently execute scans with
+ * SAOP clauses.
+ *
+ * btree automatically combines individual ScalarArrayOpExpr primitive
+ * index scans whenever the tuples covered by the next set of array keys
+ * are close to tuples covered by the current set. This makes the final
+ * number of descents particularly difficult to estimate. However, btree
+ * scans never visit any single leaf page more than once. That puts a
+ * natural floor under the worst case number of descents.
+ *
+ * It's particularly important that we not wildly overestimate the number
+ * of descents needed for a clause list with several SAOPs -- the costs
+ * really aren't multiplicative in the way genericcostestimate expects. In
+ * general, most distinct combinations of SAOP keys will tend to not find
+ * any matching tuples. Furthermore, btree scans search for the next set
+ * of array keys using the next tuple in line, and so won't even need a
+ * direct comparison to eliminate most non-matching sets of array keys.
+ *
+ * Clamp the number of descents to the estimated number of leaf page
+ * visits. This is still fairly pessimistic, but tends to result in more
+ * accurate costing of scans with several SAOP clauses -- especially when
+ * each array has more than a few elements. The cost of adding additional
+ * array constants to a low-order SAOP column should saturate past a
+ * certain point (except where selectivity estimates continue to shift).
+ *
+ * Also clamp the number of descents to 1/3 the number of index pages.
+ * This avoids implausibly high estimates with low selectivity paths,
+ * where scans frequently require no more than one or two descents.
+ */
+ if (costs.num_sa_scans > 1)
+ {
+ costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+ costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+ costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+ }
+
/*
* Add a CPU-cost component to represent the costs of initial btree
* descent. We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6845,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* comparisons to descend a btree of N leaf tuples. We charge one
* cpu_operator_cost per comparison.
*
- * If there are ScalarArrayOpExprs, charge this once per SA scan. The
- * ones after the first one are not startup cost so far as the overall
- * plan is concerned, so add them only to "total" cost.
+ * If there are ScalarArrayOpExprs, charge this once per estimated
+ * primitive SA scan. The ones after the first one are not startup cost
+ * so far as the overall plan goes, so just add them to "total" cost.
*/
if (index->tuples > 1) /* avoid computing log(0) */
{
@@ -6858,7 +6864,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* in cases where only a single leaf page is expected to be visited. This
* cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
* touched. The number of such pages is btree tree height plus one (ie,
- * we charge for the leaf page too). As above, charge once per SA scan.
+ * we charge for the leaf page too). As above, charge once per estimated
+ * primitive SA scan.
*/
descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1149093a8..6a5068c72 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4005,6 +4005,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</para>
</note>
+ <note>
+ <para>
+ Every time an index is searched, the index's
+ <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+ field is incremented. This usually happens once per index scan node
+ execution, but might take place several times during execution of a scan
+ that searches for multiple values together. Only queries that use certain
+ <acronym>SQL</acronym> constructs to search for rows matching any value
+ out of a list (or an array) of multiple scalar values are affected. See
+ <xref linkend="functions-comparisons"/> for details.
+ </para>
+ </note>
+
</sect2>
<sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
(1 row)
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
--------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------
Index Only Scan using tenk1_thous_tenthous on tenk1
- Index Cond: (thousand < 2)
- Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
SET enable_indexonlyscan = OFF;
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
- Sort Key: thousand
- -> Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
RESET enable_indexonlyscan;
--
-- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index b95d30f65..25815634c 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7795,10 +7795,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
Merge Cond: (j1.id1 = j2.id1)
Join Filter: (j2.id2 = j1.id2)
-> Index Scan using j1_id1_idx on j1
- -> Index Only Scan using j2_pkey on j2
+ -> Index Scan using j2_id1_idx on j2
Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
- Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
select * from j1
inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
SET enable_indexonlyscan = OFF;
explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
RESET enable_indexonlyscan;
--
--
2.42.0
On Sat, 21 Oct 2023 at 00:40, Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being.Attached is v5, which deals with the conflict with the optimization
added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
optimization is now only disabled in cases without array scan keys.
(It'd be very hard to make it work with array scan keys, since an
important principle for my patch is that we can change search-type
scan keys right in the middle of any _bt_readpage() call).
I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?
If not, I'll probably be able to discover my own understanding of the
mechanism used in the patch, but if there is a framework to build that
understanding on (rather than having to build it from scratch) that'd
be greatly appreciated.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?
This is an excellent question. You're entirely right: there isn't
enough information about the design of the state machine.
In v1 of the patch, from all the way back in July, the "state machine"
advanced in the hackiest way possible: via repeated "incremental"
advancement (using logic from the function that we call
_bt_advance_array_keys() on HEAD) in a loop -- we just kept doing that
until the function I'm now calling _bt_tuple_before_array_skeys()
eventually reported that the array keys were now sufficiently
advanced. v2 greatly improved matters by totally overhauling
_bt_advance_array_keys(): it was taught to use binary searches to
advance the array keys, with limited remaining use of "incremental"
array key advancement.
However, version 2 (and all later versions to date) have somewhat
wonky state machine transitions, in one important respect: calls to
the new _bt_advance_array_keys() won't always advance the array keys
to the maximum extent possible (possible while still getting correct
behavior, that is). There were still various complicated scenarios
involving multiple "required" array keys (SK_BT_REQFWD + SK_BT_REQBKWD
scan keys that use BTEqualStrategyNumber), where one single call to
_bt_advance_array_keys() would advance the array keys to a point that
was still < caller's tuple. AFAICT this didn't cause wrong answers to
queries (that would require failing to find a set of exactly matching
array keys where a matching set exists), but it was kludgey. It was
sloppy in roughly the same way as the approach in my v1 prototype was
sloppy (just to a lesser degree).
I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review. The upcoming version will have simple
preconditions and postconditions for the function that advances the
array key state machine (the new _bt_advance_array_keys). These are
enforced by assertions at the start and end of the function. So the
rules for the state machine become crystal clear and fairly easy to
keep in your head (e.g., tuple must be >= required array keys on entry
and <= required array keys on exit, the array keys must always either
advance by one increment or be completely exhausted for the top-level
scan in the current scan direction).
Unsurprisingly, I found that adding and enforcing these invariants led
to a simpler and more general design within _bt_advance_array_keys.
That code is still the most complicated part of the patch, but it's
much less of a bag of tricks. Another reason for you to hold off for a
few more days.
--
Peter Geoghegan
On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?This is an excellent question. You're entirely right: there isn't
enough information about the design of the state machine.I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review.
Okay, thanks for the update, then I'll wait for v6 to be posted.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:
I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review.Okay, thanks for the update, then I'll wait for v6 to be posted.
On second thought, I'll just post v6 now (there won't be conflicts
against the master branch once the other patch is committed anyway).
Highlights:
* Major simplifications to the array key state machine, already
described by my recent email.
* Added preprocessing of "redundant and contradictory" array elements
to _bt_preprocess_array_keys().
This makes the special preprocessing pass just for array keys
("preprocessing preprocessing") within _bt_preprocess_array_keys()
make this query into a no-op:
select * from tab where a in (180, 345) and a in (230, 300); -- contradictory
Similarly, it can make this query only attempt one single primitive
index scan for "230":
select * from tab where a in (180, 230) and a in (230, 300); -- has
redundancies, plus some individual elements contradict each other
This duplicates some of what _bt_preprocess_keys can do already. But
_bt_preprocess_keys can only do this stuff at the level of individual
array elements/primitive index scans. Whereas this works "one level
up", allowing preprocessing to see the full picture rather than just
seeing the start of one particular primitive index scan. It explicitly
works across array keys, saving repeat work inside
_bt_preprocess_keys. That could really add up with thousands of array
keys and/or multiple SAOPs. (Note that _bt_preprocess_array_keys
already does something like this, to deal with SAOP inequalities such
as "WHERE my_col >= any (array[1, 2])" -- it's a little surprising
that this obvious optimization wasn't part of the original nbtree SAOP
patch.)
This reminds me: you might want to try breaking the patch by coming up
with adversarial cases, Matthias. The patch needs to be able to deal
with absurdly large amounts of array keys reasonably well, because it
proposes to normalize passing those to the nbtree code. It's
especially important that the patch never takes too much time to do
something (e.g., binary searching through array keys) while holding a
buffer lock -- even with very silly adversarial queries.
So, for example, queries like this one (specifically designed to
stress the implementation) *need* to work reasonably well:
with a as (
select i from generate_series(0, 500000) i
)
select
count(*), thousand, tenthous
from
tenk1
where
thousand = any (array[(select array_agg(i) from a)]) and
tenthous = any (array[(select array_agg(i) from a)])
group by
thousand, tenthous
order by
thousand, tenthous;
(You can run this yourself after the regression tests finish, of course.)
This takes about 130ms on my machine, hardly any of which takes place
in the nbtree code with the patch (think tens of microseconds per
_bt_readpage call, at most) -- the plan is an index-only scan that
gets only 30 buffer hits. On the master branch, it's vastly slower --
1000025 buffer hits. The query as a whole takes about 3 seconds there.
If you have 3 or 4 SOAPs (with a composite index that has as many
columns) you can quite easily DOS the master branch, since the planner
makes a generic assumption that each of these SOAPs will have only 10
elements. The planner also thinks that with the patch applied, with
one important difference: it doesn't matter to nbtree. The cost of
scanning each index page should be practically independent of the
total size of each array, at least past a certain point. Similarly,
the maximum cost of an index scan should be approximately fixed: it
should be capped at the cost of a full index scan (with the added cost
of these relatively expensive quals still capped, still essentially
independent of array sizes past some point).
I notice that if I remove the "thousand = any (array[(select
array_agg(i) from a)]) and" line from the adversarial query, executing
the resulting query still get 30 buffer hits with the patch -- though
it only takes 90ms this time (it's faster for reasons that likely have
less than you'd think to do with nbtree overheads). This is just
another way of getting roughly the same full index scan. That's a
completely new way of thinking about nbtree SAOPs from a planner
perspective (also from a user's perspective, I suppose).
It's important that the planner's new optimistic assumptions about the
cost profile of SOAPS (that it can expect reasonable
performance/access patterns with wildly unreasonable/huge/arbitrarily
complicated SAOPs) always be met by nbtree -- no repeat index page
accesses, no holding a buffer lock for more than (say) a small
fraction of 1 millisecond (no matter the complexity of the query), and
possibly other things I haven't thought of yet.
If you end up finding a bug in this v6, it'll most likely be a case
where nbtree fails to live up to that. This project is as much about
robust/predictable performance as anything else -- nbtree needs to be
able to cope with practically anything. I suggest that your review
start by trying to break the patch along these lines.
--
Peter Geoghegan
Attachments:
v6-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v6-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload
From 1ae97d26aa5a1fb3e7dafc4160960bc144e4be9e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v6] Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals. This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.
The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access). The general idea of giving the index AM this
additional context can be pushed a lot further, though.
Teach nbtree SAOP index scans to advance array scan keys by applying
information about the physical characteristics of the index at runtime.
The array key state machine advances the current array keys using the
next index tuple in line to be scanned, at the point where the scan
reaches the end of index tuples matching its current array keys. We
dynamically decide whether to perform another primitive index scan (or
whether to stick with the ongoing leaf level traversal) using a set of
heuristics that aim to minimize repeat index descents. This approach
can be far more efficient: many cases that previously required thousands
of primitive index scans now require as few as one single primitive
index scan. All duplicative index page accesses are now avoided.
nbtree can now execute required and non-required array/SAOP scan keys in
the most efficient way possible. Naturally, only required SAOP keys
(i.e. those that can terminate the top-level scan) are capable of
triggering a new primitive index scan; non-required keys never affect
the scan's position. Consequently, index scans on a composite index
with (say) a high-order inequality key and a low-order SAOP key (which
nbtree will make into a non-required scan key) will now reliably output
rows in index order. The scan is always executed as one large index
scan under the hood, which is obviously the fastest way to do it, for
the usual reasons: it avoids useless repeat index page accesses across
successive primitive index scans. More importantly, nbtree's very
general approach removes any question of index scan nodes outputting
rows in an order that doesn't match the index. This enables the removal
of various special cases from the planner -- which in turn makes the
nbtree enhancements more effective and more widely applicable.
Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute. These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without low-order
ScalarArrayOpExpr quals (paths where the quals appear as filter quals
instead). Now there is never any need to make a cost-based choice
between an index scan that can be trusted to return tuples in index
order (but has SAOP filter quals), and a more selective index scan that
can apply true SAOP index quals for one or more low-order index columns
(but cannot be trusted to produce tuples in index order).
Many of the queries sped up by the enhancements added by this commit
won't benefit much from avoiding repeat index page accesses. The most
compelling cases are those where query execution _completely_ avoids
many heap page accesses that filter quals would have otherwise required,
just to eliminate one or more non-matching rows from each heap page.
(In general, index scan filter quals always need "extra" heap accesses
to eliminate non-matching rows, since expression evaluation is only
deemed safe with visible rows. Whereas index quals never require inline
visibility checks; they can just eliminate non-matching rows up front.)
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
src/include/access/nbtree.h | 42 +-
src/backend/access/nbtree/nbtree.c | 63 +-
src/backend/access/nbtree/nbtsearch.c | 92 +-
src/backend/access/nbtree/nbtutils.c | 1472 +++++++++++++++++++-
src/backend/optimizer/path/indxpath.c | 86 +-
src/backend/utils/adt/selfuncs.c | 122 +-
doc/src/sgml/monitoring.sgml | 13 +
src/test/regress/expected/create_index.out | 61 +-
src/test/regress/expected/join.out | 5 +-
src/test/regress/sql/create_index.sql | 20 +-
10 files changed, 1700 insertions(+), 276 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086..566e1c15d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -965,7 +965,7 @@ typedef struct BTScanPosData
* moreLeft and moreRight track whether we think there may be matching
* index entries to the left and right of the current page, respectively.
* We can clear the appropriate one of these flags when _bt_checkkeys()
- * returns continuescan = false.
+ * sets BTReadPageState.continuescan = false.
*/
bool moreLeft;
bool moreRight;
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
/* workspace for SK_SEARCHARRAY support */
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- bool arraysStarted; /* Started array keys, but have yet to "reach
- * past the end" of all arrays? */
int numArrayKeys; /* number of equality-type array keys (-1 if
* there are any unsatisfiable array keys) */
- int arrayKeyCount; /* count indicating number of array scan keys
- * processed */
+ bool needPrimScan; /* Perform another primitive scan? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
+ FmgrInfo *orderProcs; /* ORDER procs for equality constraint keys */
+ int numPrimScans; /* Running tally of # primitive index scans
+ * (used to coordinate parallel workers) */
MemoryContext arrayContext; /* scan-lifespan context for array data */
/* info about killed items if any (killedItems is NULL if never used) */
@@ -1083,6 +1083,29 @@ typedef struct BTScanOpaqueData
typedef BTScanOpaqueData *BTScanOpaque;
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the final tuple from the page. This must happen
+ * before the first call to _bt_checkkeys. _bt_checkkeys uses the final tuple
+ * to manage advancement of the scan's array keys more efficiently.
+ */
+typedef struct BTReadPageState
+{
+ /* Input parameters, set by _bt_readpage */
+ ScanDirection dir; /* current scan direction */
+ IndexTuple finaltup; /* final tuple (high key for forward scans) */
+
+ /* Output parameters, set by _bt_checkkeys */
+ bool continuescan; /* Terminate ongoing (primitive) index scan? */
+
+ /* Private _bt_checkkeys-managed state */
+ bool finaltupchecked; /* final tuple checked against current
+ * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
/*
* We use some private sk_flags bits in preprocessed scan keys. We're allowed
* to use bits 16-31 (see skey.h). The uppermost bits are copied from the
@@ -1090,6 +1113,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
*/
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
+#define SK_BT_RDDNARRAY 0x00040000 /* redundant in array preprocessing */
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1160,7 +1184,7 @@ extern bool btcanreturn(Relation index, int attno);
extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
@@ -1253,12 +1277,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan,
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup,
bool requiredMatchedByPrecheck);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a88b36a58..6328a8a63 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
* BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
* to a new page; some process can start doing that.
*
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit). Reached once per primitive index scan.
*/
typedef enum
{
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
BTPS_State btps_pageStatus; /* indicates whether next page is
* available for scan. see above for
* possible states of parallel scan. */
- int btps_arrayKeyCount; /* count indicating number of array scan
- * keys processed by parallel scan */
+ int btps_numPrimScans; /* count indicating number of primitive
+ * index scans (used with array keys) */
slock_t btps_mutex; /* protects above variables */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
} BTParallelScanDescData;
@@ -275,8 +275,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
/* If we have a tuple, return it ... */
if (res)
break;
- /* ... otherwise see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+ /* ... otherwise see if we need another primitive index scan */
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
return res;
}
@@ -333,8 +333,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
ntids++;
}
}
- /* Now see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+ /* Now see if we need another primitive index scan */
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
return ntids;
}
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->keyData = NULL;
so->arrayKeyData = NULL; /* assume no array keys for now */
- so->arraysStarted = false;
so->numArrayKeys = 0;
+ so->needPrimScan = false;
so->arrayKeys = NULL;
+ so->orderProcs = NULL;
so->arrayContext = NULL;
so->killedItems = NULL; /* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
}
so->markItemIndex = -1;
- so->arrayKeyCount = 0;
+ so->needPrimScan = false;
+ so->numPrimScans = 0;
so->firstPage = false;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
@@ -588,7 +590,7 @@ btinitparallelscan(void *target)
SpinLockInit(&bt_target->btps_mutex);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- bt_target->btps_arrayKeyCount = 0;
+ bt_target->btps_numPrimScans = 0;
ConditionVariableInit(&bt_target->btps_cv);
}
@@ -614,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
SpinLockAcquire(&btscan->btps_mutex);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount = 0;
+ btscan->btps_numPrimScans = 0;
SpinLockRelease(&btscan->btps_mutex);
}
@@ -625,7 +627,11 @@ btparallelrescan(IndexScanDesc scan)
*
* The return value is true if we successfully seized the scan and false
* if we did not. The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys. It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
*
* If the return value is true, *pageno returns the next or current page
* of the scan (depending on the scan direction). An invalid block number
@@ -656,16 +662,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
SpinLockAcquire(&btscan->btps_mutex);
pageStatus = btscan->btps_pageStatus;
- if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ if (so->numPrimScans < btscan->btps_numPrimScans)
{
- /* Parallel scan has already advanced to a new set of scankeys. */
+ /* Top-level scan already moved on to next primitive index scan */
status = false;
}
else if (pageStatus == BTPARALLEL_DONE)
{
/*
- * We're done with this set of scankeys. This may be the end, or
- * there could be more sets to try.
+ * We're done with this primitive index scan. This might have
+ * been the final primitive index scan required, or the top-level
+ * index scan might require additional primitive scans.
*/
status = false;
}
@@ -697,9 +704,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
void
_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
{
+ BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
+ Assert(!so->needPrimScan);
+
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
@@ -733,12 +743,11 @@ _bt_parallel_done(IndexScanDesc scan)
parallel_scan->ps_offset);
/*
- * Mark the parallel scan as done for this combination of scan keys,
- * unless some other process already did so. See also
- * _bt_advance_array_keys.
+ * Mark the primitive index scan as done, unless some other process
+ * already did so. See also _bt_array_keys_remain.
*/
SpinLockAcquire(&btscan->btps_mutex);
- if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+ if (so->numPrimScans >= btscan->btps_numPrimScans &&
btscan->btps_pageStatus != BTPARALLEL_DONE)
{
btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -752,14 +761,14 @@ _bt_parallel_done(IndexScanDesc scan)
}
/*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- * keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ * counter when array keys are in use.
*
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
* scans.
*/
void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -768,13 +777,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- so->arrayKeyCount++;
+ so->numPrimScans++;
SpinLockAcquire(&btscan->btps_mutex);
if (btscan->btps_pageStatus == BTPARALLEL_DONE)
{
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount++;
+ btscan->btps_numPrimScans++;
}
SpinLockRelease(&btscan->btps_mutex);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5..b2addd714 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
*/
if (!so->qual_ok)
{
- /* Notify any other workers that we're done with this scan key. */
+ /* Notify any other workers that this primitive scan is done */
_bt_parallel_done(scan);
return false;
}
@@ -952,6 +952,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* one we use --- by definition, they are either redundant or
* contradictory.
*
+ * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+ * used to avoid prematurely stopping the scan when an array equality qual
+ * has its array keys advanced.
+ *
* Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
* If the index stores nulls at the end of the index we'll be starting
* from, and we have no boundary key for the column (which means the key
@@ -1537,9 +1541,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
BTPageOpaque opaque;
OffsetNumber minoff;
OffsetNumber maxoff;
+ BTReadPageState pstate;
int itemIndex;
- bool continuescan;
- int indnatts;
bool requiredMatchedByPrecheck;
/*
@@ -1560,8 +1563,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
}
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ pstate.dir = dir;
+ pstate.finaltup = NULL;
+ pstate.continuescan = true; /* default assumption */
+ pstate.finaltupchecked = false;
+
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
@@ -1609,9 +1615,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* the last item on the page would give a more precise answer.
*
* We skip this for the first page in the scan to evade the possible
- * slowdown of the point queries.
+ * slowdown of the point queries. Do the same with scans with array keys,
+ * since that makes the optimization unsafe (our search-type scan keys can
+ * change during any call to _bt_checkkeys whenever array keys are used).
*/
- if (!so->firstPage && minoff < maxoff)
+ if (!so->firstPage && minoff < maxoff && !so->numArrayKeys)
{
ItemId iid;
IndexTuple itup;
@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* set flag to true if all required keys are satisfied and false
* otherwise.
*/
- (void) _bt_checkkeys(scan, itup, indnatts, dir,
- &requiredMatchedByPrecheck, false);
+ _bt_checkkeys(scan, &pstate, itup, false, false);
+ requiredMatchedByPrecheck = pstate.continuescan;
+ pstate.continuescan = true; /* reset */
}
else
{
@@ -1636,6 +1645,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (ScanDirectionIsForward(dir))
{
+ /* SK_SEARCHARRAY forward scans must provide high key up front */
+ if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+ }
+
/* load items[] in ascending order */
itemIndex = 0;
@@ -1659,8 +1676,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, requiredMatchedByPrecheck);
+ passes_quals = _bt_checkkeys(scan, &pstate, itup, false,
+ requiredMatchedByPrecheck);
/*
* If the result of prechecking required keys was true, then in
@@ -1668,8 +1685,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* result is the same.
*/
Assert(!requiredMatchedByPrecheck ||
- passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, false));
+ passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+ false));
if (passes_quals)
{
/* tuple passes all scan key conditions */
@@ -1703,7 +1720,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
/* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
+ if (!pstate.continuescan)
break;
offnum = OffsetNumberNext(offnum);
@@ -1720,17 +1737,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* only appear on non-pivot tuples on the right sibling page are
* common.
*/
- if (continuescan && !P_RIGHTMOST(opaque))
+ if (pstate.continuescan && !P_RIGHTMOST(opaque))
{
- ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
+ IndexTuple itup;
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
+ if (pstate.finaltup)
+ itup = pstate.finaltup;
+ else
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ itup = (IndexTuple) PageGetItem(page, iid);
+ }
+
+ _bt_checkkeys(scan, &pstate, itup, true, false);
}
- if (!continuescan)
+ if (!pstate.continuescan)
so->currPos.moreRight = false;
Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1740,6 +1763,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
else
{
+ /* SK_SEARCHARRAY backward scans must provide final tuple up front */
+ if (so->numArrayKeys && minoff < maxoff)
+ {
+ ItemId iid = PageGetItemId(page, minoff);
+
+ pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+ }
+
/* load items[] in descending order */
itemIndex = MaxTIDsPerBTreePage;
@@ -1751,6 +1782,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
IndexTuple itup;
bool tuple_alive;
bool passes_quals;
+ bool finaltup = (offnum == minoff);
/*
* If the scan specifies not to return killed tuples, then we
@@ -1761,12 +1793,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* tuple on the page, we do check the index keys, to prevent
* uselessly advancing to the page to the left. This is similar
* to the high key optimization used by forward scans.
+ *
+ * Separately, _bt_checkkeys actually requires that we call it
+ * with the final non-pivot tuple from the page, if there's one
+ * (final processed tuple, or first tuple in offset number terms).
+ * We must indicate which particular tuple comes last, too.
*/
if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
{
Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
+ if (!finaltup)
{
+ Assert(offnum > minoff);
offnum = OffsetNumberPrev(offnum);
continue;
}
@@ -1778,8 +1816,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, requiredMatchedByPrecheck);
+ passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+ requiredMatchedByPrecheck);
/*
* If the result of prechecking required keys was true, then in
@@ -1787,8 +1825,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* result is the same.
*/
Assert(!requiredMatchedByPrecheck ||
- passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, false));
+ passes_quals == _bt_checkkeys(scan, &pstate, itup,
+ finaltup, false));
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions */
@@ -1827,7 +1865,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
}
- if (!continuescan)
+ if (!pstate.continuescan)
{
/* there can't be any more matches, so stop */
so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1510b97fb..8318e6250 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
typedef struct BTSortArrayContext
{
- FmgrInfo flinfo;
+ FmgrInfo *orderproc;
Oid collation;
bool reverse;
} BTSortArrayContext;
@@ -41,15 +41,41 @@ typedef struct BTSortArrayContext
static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
StrategyNumber strat,
Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
static int _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems);
+static int _bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+ Datum *elems_orig, int nelems_orig,
+ Datum *elems_next, int nelems_next);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+ Datum tupdatum, bool tupnull,
+ Datum arrdatum, ScanKey cur);
+static int _bt_binsrch_array_skey(FmgrInfo *orderproc,
+ bool cur_elem_start, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *final_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+ BTReadPageState *pstate,
+ IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger);
+static void _bt_preprocess_keys_leafbuf(IndexScanDesc scan);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_array_scankeys(IndexScanDesc scan);
+#endif
static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
ScanKey leftarg, ScanKey rightarg,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger,
+ bool requiredMatchedByPrecheck);
static bool _bt_check_rowcompare(ScanKey skey,
IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
ScanDirection dir, bool *continuescan);
@@ -198,13 +224,48 @@ _bt_freestack(BTStack stack)
* If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
* set up BTArrayKeyInfo info for each one that is an equality-type key.
* Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation. For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * array elements.
+ *
+ * _bt_preprocess_keys treats each primitive scan as an independent piece of
+ * work. That structure pushes the responsibility for preprocessing that must
+ * work "across array keys" onto us. This division of labor makes sense once
+ * you consider that we're typically called no more than once per btrescan,
+ * whereas _bt_preprocess_keys is always called once per primitive index scan.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value. This eliminates
+ * all but one array key as redundant. Similarly, we are capable of "merging
+ * together" multiple equality array keys from two or more input scan keys
+ * into a single output scan key that contains only the intersecting array
+ * elements. This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.
+ *
+ * Note: _bt_start_array_keys actually sets up the cur_elem counters later on,
+ * once the scan direction is known.
*
* Note: the reason we need so->arrayKeyData, rather than just scribbling
* on scan->keyData, is that callers are permitted to call btrescan without
* supplying a new set of scankey data.
+ *
+ * Note: _bt_preprocess_keys is responsible for creating the so->keyData scan
+ * keys used by _bt_checkkeys. Index scans that don't use equality array keys
+ * will have _bt_preprocess_keys treat scan->keyData as input and so->keyData
+ * as output. Scans that use equality array keys have _bt_preprocess_keys
+ * treat so->arrayKeyData (which is our output) as their input, while (as per
+ * usual) outputting so->keyData for _bt_checkkeys. This function adds an
+ * additional layer of indirection that allows _bt_preprocess_keys to more or
+ * less avoid dealing with SK_SEARCHARRAY as a special case.
+ *
+ * Note: _bt_preprocess_keys_leafbuf works by updating already-processed
+ * output keys (so->keyData) in-place. It cannot eliminate redundant or
+ * contradictory scan keys. This necessitates having _bt_preprocess_keys
+ * understand that it is unsafe to eliminate "redundant" SK_SEARCHARRAY
+ * equality scan keys on the basis of what is actually just the current array
+ * key values -- it must conservatively assume that such a scan key might no
+ * longer be redundant after the next _bt_preprocess_keys_leafbuf call.
+ * Ideally we'd be able to deal with that by eliminating a subset of truly
+ * redundant array keys up-front, but it doesn't seem worth the trouble.
*/
void
_bt_preprocess_array_keys(IndexScanDesc scan)
@@ -212,7 +273,9 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int numberOfKeys = scan->numberOfKeys;
int16 *indoption = scan->indexRelation->rd_indoption;
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
int numArrayKeys;
+ int lastEqualityArrayAtt = -1;
ScanKey cur;
int i;
MemoryContext oldContext;
@@ -265,6 +328,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+ so->orderProcs = (FmgrInfo *) palloc0(nkeyatts * sizeof(FmgrInfo));
/* Now process each array key */
numArrayKeys = 0;
@@ -281,6 +345,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int j;
cur = &so->arrayKeyData[i];
+
+ /*
+ * Attributes with equality-type scan keys (including but not limited
+ * to array scan keys) will need a 3-way comparison function. Set
+ * that up now. (Avoids repeating work for the same attribute.)
+ */
+ if (cur->sk_strategy == BTEqualStrategyNumber &&
+ !OidIsValid(so->orderProcs[cur->sk_attno - 1].fn_oid))
+ _bt_sort_cmp_func_setup(scan, cur);
+
if (!(cur->sk_flags & SK_SEARCHARRAY))
continue;
@@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
elem_values, num_nonnulls);
+ /*
+ * If this scan key is semantically equivalent to a previous equality
+ * operator array scan key, merge the two arrays together to eliminate
+ * redundant non-intersecting elements (and redundant whole scan keys)
+ */
+ if (lastEqualityArrayAtt == cur->sk_attno)
+ {
+ BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+ Assert(so->arrayKeyData[prev->scan_key].sk_func.fn_oid ==
+ cur->sk_func.fn_oid);
+ Assert(so->arrayKeyData[prev->scan_key].sk_subtype ==
+ cur->sk_subtype);
+
+ /* We could pfree(elem_values) after, but not worth the cycles */
+ num_elems = _bt_merge_arrays(scan, cur,
+ (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+ prev->elem_values, prev->num_elems,
+ elem_values, num_elems);
+
+ /*
+ * If there are no intersecting elements left from merging this
+ * array into the previous array on the same attribute, the scan
+ * qual is unsatisfiable
+ */
+ if (num_elems == 0)
+ {
+ numArrayKeys = -1;
+ break;
+ }
+
+ /*
+ * Lower the number of elements from the previous array, and mark
+ * this scan key/array as redundant for every primitive index scan
+ */
+ prev->num_elems = num_elems;
+ cur->sk_flags |= SK_BT_RDDNARRAY;
+ continue;
+ }
+
/*
* And set up the BTArrayKeyInfo data.
*/
@@ -364,6 +478,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
numArrayKeys++;
+ lastEqualityArrayAtt = cur->sk_attno;
}
so->numArrayKeys = numArrayKeys;
@@ -437,26 +552,20 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
}
/*
- * _bt_sort_array_elements() -- sort and de-dup array elements
+ * Look up the appropriate comparison function in the opfamily.
*
- * The array elements are sorted in-place, and the new number of elements
- * after duplicate removal is returned.
- *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics. If reverse is true, we sort in descending order.
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
*/
-static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
- bool reverse,
- Datum *elems, int nelems)
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
Oid elemtype;
RegProcedure cmp_proc;
- BTSortArrayContext cxt;
-
- if (nelems <= 1)
- return nelems; /* no work to do */
+ FmgrInfo *orderproc = &so->orderProcs[skey->sk_attno - 1];
/*
* Determine the nominal datatype of the array elements. We have to
@@ -471,12 +580,10 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
* Look up the appropriate comparison function in the opfamily.
*
* Note: it's possible that this would fail, if the opfamily is
- * incomplete, but it seems quite unlikely that an opfamily would omit
- * non-cross-type support functions for any datatype that it supports at
- * all.
+ * incomplete.
*/
cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
- elemtype,
+ rel->rd_opcintype[skey->sk_attno - 1],
elemtype,
BTORDER_PROC);
if (!RegProcedureIsValid(cmp_proc))
@@ -484,8 +591,32 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
BTORDER_PROC, elemtype, elemtype,
rel->rd_opfamily[skey->sk_attno - 1]);
+ /* Save in orderproc entry for attribute */
+ fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
+/*
+ * _bt_sort_array_elements() -- sort and de-dup array elements
+ *
+ * The array elements are sorted in-place, and the new number of elements
+ * after duplicate removal is returned.
+ *
+ * scan and skey identify the index column, whose opfamily determines the
+ * comparison semantics. If reverse is true, we sort in descending order.
+ */
+static int
+_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
+ bool reverse,
+ Datum *elems, int nelems)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTSortArrayContext cxt;
+
+ if (nelems <= 1)
+ return nelems; /* no work to do */
+
/* Sort the array elements */
- fmgr_info(cmp_proc, &cxt.flinfo);
+ cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
cxt.collation = skey->sk_collation;
cxt.reverse = reverse;
qsort_arg(elems, nelems, sizeof(Datum),
@@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
_bt_compare_array_elements, &cxt);
}
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan key's have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+ Datum *elems_orig, int nelems_orig,
+ Datum *elems_next, int nelems_next)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTSortArrayContext cxt;
+ Datum *merged = palloc(sizeof(Datum) * nelems_orig);
+ int merged_nelems = 0;
+
+ /*
+ * Incrementally copy the original array into a temp buffer, skipping over
+ * any items that are missing from the "next" array
+ */
+ cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
+ cxt.collation = skey->sk_collation;
+ cxt.reverse = reverse;
+ for (int i = 0; i < nelems_orig; i++)
+ {
+ Datum *elem = elems_orig + i;
+
+ if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+ _bt_compare_array_elements, &cxt))
+ merged[merged_nelems++] = *elem;
+ }
+
+ /*
+ * Overwrite the original array with temp buffer so that we're only left
+ * with intersecting array elements
+ */
+ memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+ pfree(merged);
+
+ return merged_nelems;
+}
+
/*
* qsort_arg comparator for sorting array elements
*/
@@ -507,7 +680,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
int32 compare;
- compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+ compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
cxt->collation,
da, db));
if (cxt->reverse)
@@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
return compare;
}
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This routine returns:
+ * <0 if tupdatum < arrdatum;
+ * 0 if tupdatum == arrdatum;
+ * >0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+ Datum tupdatum, bool tupnull,
+ Datum arrdatum, ScanKey cur)
+{
+ int32 result = 0;
+
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+ Assert((cur->sk_flags & SK_ROW_HEADER) == 0);
+
+ if (cur->sk_flags & SK_ISNULL) /* array/scan key is NULL */
+ {
+ if (tupnull)
+ result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NULL "<" NOT_NULL */
+ else
+ result = -1; /* NULL ">" NOT_NULL */
+ }
+ else if (tupnull) /* array/scan key is NOT_NULL and tuple item
+ * is NULL */
+ {
+ if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NOT_NULL ">" NULL */
+ else
+ result = 1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * Like _bt_compare, we need to be careful of cross-type comparisons,
+ * so the left value has to be the value that came from an index
+ * tuple. (Array scan keys cannot be cross-type, but other required
+ * scan keys that use an equal operator can be.)
+ */
+ result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+ tupdatum, arrdatum));
+
+ /*
+ * Unlike _bt_compare, we flip the sign when column is a DESC column
+ * (and *not* when column is ASC). This matches the approach taken by
+ * _bt_check_rowcompare, which performs similar three-way comparisons.
+ */
+ if (cur->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan). This (and information about the scan's direction) allows
+ * searches against required scan key arrays to reuse earlier search bounds as
+ * an optimization.
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to caller's tupdatum argument.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+ bool cur_elem_start, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *final_result)
+{
+ int low_elem,
+ mid_elem,
+ high_elem,
+ result = 0;
+
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+ Assert(!cur_elem_start ||
+ array->elem_values[array->cur_elem] == cur->sk_argument);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ if (cur_elem_start)
+ low_elem = array->cur_elem;
+ else
+ low_elem = 0;
+ high_elem = array->num_elems - 1;
+ }
+ else
+ {
+ low_elem = 0;
+ if (cur_elem_start)
+ high_elem = array->cur_elem;
+ else
+ high_elem = array->num_elems - 1;
+ }
+ mid_elem = -1;
+
+ while (high_elem > low_elem)
+ {
+ Datum arrdatum;
+
+ mid_elem = low_elem + ((high_elem - low_elem) / 2);
+ arrdatum = array->elem_values[mid_elem];
+
+ result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+ arrdatum, cur);
+
+ if (result == 0)
+ {
+ /*
+ * Each array was deduplicated during initial preprocessing, so
+ * it's safe to quit as soon as we see an equal array element.
+ * This often saves an extra comparison or two...
+ */
+ low_elem = mid_elem;
+ break;
+ }
+
+ if (result > 0)
+ low_elem = mid_elem + 1;
+ else
+ high_elem = mid_elem;
+ }
+
+ /*
+ * ...but our caller also cares about how its searched-for tuple datum
+ * compares to the array element we'll return. We set *final_result with
+ * the result of that comparison specifically.
+ *
+ * Avoid setting *final_result to the wrong comparison's result.
+ */
+ if (low_elem != mid_elem)
+ result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+ array->elem_values[low_elem], cur);
+
+ *final_result = result;
+
+ return low_elem;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -539,30 +867,35 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
curArrayKey->cur_elem = 0;
skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
}
-
- so->arraysStarted = true;
}
/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction. When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
*
* Returns true if there is another set of values to consider, false if not.
* On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array is still at its final element for scan direction).
*/
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool found = false;
- int i;
+
+ Assert(!so->needPrimScan);
/*
* We must advance the last array key most quickly, since it will
* correspond to the lowest-order index column among the available
- * qualifications. This is necessary to ensure correct ordering of output
- * when there are multiple array keys.
+ * qualifications. Rolling over like this is necessary to ensure correct
+ * ordering of output when there are multiple array keys.
*/
- for (i = so->numArrayKeys - 1; i >= 0; i--)
+ for (int i = so->numArrayKeys - 1; i >= 0; i--)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
@@ -596,19 +929,31 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
break;
}
- /* advance parallel scan */
- if (scan->parallel_scan != NULL)
- _bt_parallel_advance_array_keys(scan);
+ if (found)
+ return true;
/*
- * When no new array keys were found, the scan is "past the end" of the
- * array keys. _bt_start_array_keys can still "restart" the array keys if
- * a rescan is required.
+ * Don't allow the entire set of array keys to roll over: restore the
+ * array keys to the state they were in before we were called.
+ *
+ * This ensures that the array keys only ratchet forward (or backwards in
+ * the case of backward scans). Our "so->arrayKeyData" scan keys should
+ * always match the current "so->keyData" search-type scan keys (except
+ * for a brief moment during array key advancement).
*/
- if (!found)
- so->arraysStarted = false;
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *rollarray = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[rollarray->scan_key];
- return found;
+ if (ScanDirectionIsBackward(dir))
+ rollarray->cur_elem = 0;
+ else
+ rollarray->cur_elem = rollarray->num_elems - 1;
+ skey->sk_argument = rollarray->elem_values[rollarray->cur_elem];
+ }
+
+ return false;
}
/*
@@ -622,6 +967,8 @@ _bt_mark_array_keys(IndexScanDesc scan)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int i;
+ Assert(_bt_verify_array_scankeys(scan));
+
for (i = 0; i < so->numArrayKeys; i++)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
@@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan)
* If we changed any keys, we must redo _bt_preprocess_keys. That might
* sound like overkill, but in cases with multiple keys per index column
* it seems necessary to do the full set of pushups.
- *
- * Also do this whenever the scan's set of array keys "wrapped around" at
- * the end of the last primitive index scan. There won't have been a call
- * to _bt_preprocess_keys from some other place following wrap around, so
- * we do it for ourselves.
*/
- if (changed || !so->arraysStarted)
+ if (changed)
{
_bt_preprocess_keys(scan);
/* The mark should have been set on a consistent set of keys... */
Assert(so->qual_ok);
}
+
+ Assert(_bt_verify_array_scankeys(scan));
}
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) must advance the scan's array keys.
+ * Only call here when _bt_check_compare already set continuescan=false.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans). This means that it cannot possibly be time to advance the array
+ * keys just yet. _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfying our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans). This means that it is now time for our
+ * caller to advance the array keys (unless caller broke the rules by not
+ * checking with _bt_check_compare before calling here).
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums. See
+ * function header comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ bool tuple_before_array_keys = false;
+ ScanKey cur;
+ int ntupatts = BTreeTupleGetNAtts(tuple, rel),
+ ikey;
+
+ Assert(so->numArrayKeys > 0);
+ Assert(so->numberOfKeys > 0);
+ Assert(!so->needPrimScan);
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ int attnum = cur->sk_attno;
+ FmgrInfo *orderproc;
+ Datum tupdatum;
+ bool tupnull,
+ skrequired;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required.
+ *
+ * Equality strategy scan keys are either required in both directions
+ * or neither direction, so the current scan direction doesn't need to
+ * be tested here.
+ */
+ skrequired = (cur->sk_flags & SK_BT_REQFWD);
+ Assert(!skrequired || (cur->sk_flags & SK_BT_REQBKWD));
+
+ /*
+ * Unlike _bt_advance_array_keys, we never deal with any non-required
+ * array keys. Cases where skrequiredtrigger is set to false by
+ * _bt_check_compare should never call here. We are only called after
+ * _bt_check_compare provisionally indicated that the scan should be
+ * terminated due to a _required_ scan key not being satisfied.
+ *
+ * We expect _bt_check_compare to notice and report required scan keys
+ * before non-required ones. _bt_advance_array_keys might still have
+ * to advance non-required array keys in passing for a tuple that we
+ * were called for, but it doesn't need advanced notice of that from
+ * us.
+ */
+ if (!skrequired)
+ break;
+
+ if (attnum > ntupatts)
+ {
+ /*
+ * When we reach a high key's truncated attribute, assume that the
+ * tuple attribute's value is >= the scan's search-type scan keys
+ */
+ break;
+ }
+
+ tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+ orderproc = &so->orderProcs[attnum - 1];
+ result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+ cur->sk_argument, cur);
+
+ if (result != 0)
+ {
+ if (ScanDirectionIsForward(dir))
+ tuple_before_array_keys = result < 0;
+ else
+ tuple_before_array_keys = result > 0;
+
+ break;
+ }
+ }
+
+ return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if _bt_checkkeys determined that another primitive index scan
+ * must take place by calling _bt_first. Otherwise returns false, indicating
+ * that caller's top-level scan is now past the point where further matching
+ * index tuples can be found (for the current scan direction).
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ * All other scans should just call _bt_first once, no matter what.
+ *
+ * Top-level index scans executed via multiple primitive index scans must not
+ * fail to output index tuples in the usual order for the index -- just like
+ * any other index scan would. The state machine that manages the scan's
+ * array keys must only start primitive index scans when they cover key space
+ * strictly greater than the key space for tuples that the scan has already
+ * returned (or strictly less in the backwards scan case). Otherwise the scan
+ * could output the same index tuples more than once, or in the wrong order.
+ *
+ * This is managed by limiting the cases that can trigger new primitive index
+ * scans to those involving required array scan keys and/or other required
+ * scan keys that use the equality strategy. In particular, the state machine
+ * must not allow high order required scan keys using an inequality strategy
+ * (which are only required in one scan direction) to directly trigger a new
+ * primitive index scan that advances low order non-required array scan keys.
+ * For example, a query such as "SELECT thousand, tenthous FROM tenk1 WHERE
+ * thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand" whose execution
+ * involves a scan of an index on "(thousand, tenthous)" must perform no more
+ * than a single primitive index scan. Otherwise we risk outputting tuples in
+ * the wrong order. Array key values for the non-required scan key on the
+ * "tenthous" column must not dictate top-level scan order. Primitive index
+ * scans mustn't scan tuples already scanned by some earlier primitive scan.
+ *
+ * In fact, nbtree makes a stronger guarantee than is strictly necessary here:
+ * it guarantees that the top-level scan won't repeat any leaf page reads.
+ * (Actually, that can still happen when the scan is repositioned, or the scan
+ * direction changes -- but that's just as true with other types of scans.)
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(so->numArrayKeys);
+
+ /*
+ * Array keys are advanced within _bt_checkkeys when the scan reaches the
+ * leaf level (more precisely, they're advanced when the scan reaches the
+ * end of each distinct set of array elements). This process avoids
+ * repeat access to leaf pages (across multiple primitive index scans) by
+ * opportunistically advancing the scan's array keys when it allows the
+ * primitive index scan to find nearby matching tuples (or to eliminate
+ * array keys with no matching tuples from further consideration).
+ *
+ * _bt_checkkeys sets a simple flag variable that we check here. This
+ * tells us if we need to perform another primitive index scan for the
+ * now-current array keys or not. We'll unset the flag once again to
+ * acknowledge having started a new primitive scan (or we'll see that it
+ * isn't set and end the top-level scan right away).
+ *
+ * We cannot rely on _bt_first always reaching _bt_checkkeys here. There
+ * are various scenarios where that won't happen. For example, if the
+ * index is completely empty, then _bt_first won't get as far as calling
+ * _bt_readpage/_bt_checkkeys.
+ *
+ * We also don't expect _bt_checkkeys to be reached when searching for a
+ * non-existent value that happens to be higher than any existing value in
+ * the index. No _bt_checkkeys are expected when _bt_readpage reads the
+ * rightmost page during such a scan -- even a _bt_checkkeys call against
+ * the high key won't happen. There is an analogous issue for backwards
+ * scans that search for a value lower than all existing index tuples.
+ *
+ * We don't actually require special handling for these cases -- we don't
+ * need to be explicitly instructed to _not_ perform another primitive
+ * index scan. This is correct for all of the cases we've listed so far,
+ * which all involve primitive index scans that access pages "near the
+ * boundaries of the key space" (the leftmost page, the rightmost page, or
+ * an imaginary empty leaf root page). If _bt_checkkeys cannot be reached
+ * by a primitive index scan for one set of array keys, it follows that it
+ * also won't be reached for any later set of array keys.
+ *
+ * There is one exception: the case where _bt_first's _bt_preprocess_keys
+ * call determined that the scan's input scan keys can never be satisfied.
+ * That might be true for one set of array keys, but not the next set.
+ */
+ if (!so->qual_ok)
+ {
+ /*
+ * Defensively check for interrupts -- the scan's next call to
+ * _bt_first won't be able to do so if the next set of keys also turn
+ * out to be unsatisfiable
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Can't use _bt_advance_array_keys so use incremental advancement */
+ so->needPrimScan = false;
+ if (_bt_advance_array_keys_increment(scan, dir))
+ return true;
+ }
+
+ /* Time for another primitive index scan? */
+ if (so->needPrimScan)
+ {
+ /* Have our caller call _bt_first once more */
+ so->needPrimScan = false;
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_next_primitive_scan(scan);
+
+ return true;
+ }
+
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_done(scan);
+
+ /*
+ * No more primitive index scans. Terminate the top-level scan.
+ */
+ return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false. When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple. The
+ * second call should be allowed to set pstate.continuescan=false without
+ * further intervention, since tuple must be <= the array keys after we're
+ * called (actually, that guarantee applies to all required equality-type scan
+ * keys, and does not apply to non-required array keys).
+ *
+ * When called with skrequiredtrigger=true, the call only expects to have to
+ * deal with non-required equality array keys. The rules are a little
+ * different during these calls. We'll always set pstate.continuescan=true,
+ * since (by definition) a non-required scan key never terminates the scan.
+ *
+ * If we reach the end of all of the required array keys for the current scan
+ * direction, we will effectively end the top-level index scan.
+ *
+ * This function will always advance the array keys by at least one increment
+ * (except when it ends the top-level index scan having reached a tuple beyond
+ * the scan's final array key, and except during !skrequiredtrigger calls).
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys. Calling here before that
+ * point will prematurely advance the array keys, leading to wrong query
+ * results (though this precondition is checked here via an assertion).
+ *
+ * We're responsible for ensuring that caller's tuple is <= current/newly
+ * advanced required array keys once we return (this postcondition is also
+ * checked via another assertion). We try to find an exact match, but failing
+ * that we'll advance the array keys to whatever set of keys comes next in the
+ * key space (among the keys that we actually have). In general, the scan's
+ * array keys can only ever "ratchet forwards", progressing in lock step with
+ * the scan.
+ *
+ * (The invariants are the same for backwards scans, except that the operators
+ * are flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=. In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple. We fall back on "incrementally" advancing the
+ * array keys in these cases, which all involve non-array scan keys. For
+ * example, with a composite index on (a, b) and a qual "WHERE a IN (3,5) AND
+ * b < 42", we'll be called for both "a" keys (i.e. keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42". Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called. The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)". The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time. This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys. We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool skrequiredtrigger)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0,
+ ntupatts = BTreeTupleGetNAtts(tuple, rel);
+ bool arrays_advanced = false,
+ arrays_exhausted,
+ beyond_end_advance = false,
+ all_eqtype_sk_equal = true,
+ all_required_eqtype_sk_equal PG_USED_FOR_ASSERTS_ONLY = true;
+
+ /*
+ * Must only be called when tuple is >= current required array keys
+ * (except during backwards scans, when it must be <= the array keys)
+ */
+ Assert(_bt_verify_array_scankeys(scan));
+ Assert(!skrequiredtrigger ||
+ !_bt_tuple_before_array_skeys(scan, pstate, tuple));
+
+ /*
+ * Try to advance array keys via a series of binary searches.
+ *
+ * Loop iterates through the current scankeys (so->keyData, which were
+ * output by _bt_preprocess_keys earlier) and then sets input scan keys
+ * (so->arrayKeyData scan keys) to new array values.
+ */
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array = NULL;
+ ScanKey skeyarray = NULL;
+ FmgrInfo *orderproc;
+ int attnum = cur->sk_attno,
+ set_elem = 0;
+ Datum tupdatum;
+ bool skrequired,
+ tupnull;
+ int32 result;
+
+ /*
+ * We only deal with equality strategy scan keys. We leave handling
+ * of inequalities up to _bt_check_compare.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Determine if this scan key is required.
+ *
+ * Equality strategy scan keys are either required in both directions
+ * or neither direction, so the current scan direction doesn't need to
+ * be tested here.
+ */
+ skrequired = (cur->sk_flags & SK_BT_REQFWD);
+ Assert(!skrequired || (cur->sk_flags & SK_BT_REQBKWD));
+
+ /*
+ * Set up ORDER 3-way comparison function and array state
+ */
+ orderproc = &so->orderProcs[attnum - 1];
+ if (cur->sk_flags & SK_SEARCHARRAY)
+ {
+ Assert(arrayidx < so->numArrayKeys);
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+ Assert(skeyarray->sk_attno == attnum);
+ }
+
+ /*
+ * Optimization: Skip over non-required scan keys when we know that
+ * they can't have changed (because _bt_check_compare triggered this
+ * call due to encountering an unsatisified non-required array qual)
+ */
+ if (skrequired && !skrequiredtrigger)
+ {
+ Assert(!beyond_end_advance && !arrays_advanced);
+
+ continue;
+ }
+
+ /*
+ * Here we perform steps for all array scan keys after a required
+ * array scan key whose binary search triggered "beyond end of array
+ * element" array advancement due to encountering a tuple attribute
+ * value > the closest matching array key (or < for backwards scans).
+ *
+ * We help to make sure that the array keys are ultimately advanced
+ * such that caller's tuple is < final array keys (or > final keys).
+ * We're behind the scan right now, but we'll fully "catch up" once
+ * outside the loop (we'll be immediately ahead of this tuple). See
+ * below for a detailed explanation.
+ *
+ * NB: We must do this for all arrays -- not just required arrays.
+ * Otherwise the final incremental array advancement step (that takes
+ * place just outside the loop) won't "carry" in the way we expect.
+ */
+ if (beyond_end_advance)
+ {
+ int final_elem_dir;
+
+ Assert(skrequiredtrigger);
+ Assert(!all_eqtype_sk_equal && !all_required_eqtype_sk_equal);
+
+ if (ScanDirectionIsBackward(dir) || !array)
+ final_elem_dir = 0;
+ else
+ final_elem_dir = array->num_elems - 1;
+
+ if (array && array->cur_elem != final_elem_dir)
+ {
+ array->cur_elem = final_elem_dir;
+ skeyarray->sk_argument = array->elem_values[final_elem_dir];
+ arrays_advanced = true;
+ }
+
+ continue;
+ }
+
+ /*
+ * Here we perform steps for any required scan keys after the first
+ * required scan key whose tuple attribute was < the closest matching
+ * array key when we dealt with it (or > for backwards scans).
+ *
+ * This earlier required array key already puts us ahead of caller's
+ * tuple in the key space (for the current scan direction). We must
+ * make sure that subsequent lower-order array keys do not put us too
+ * far ahead (ahead of tuples that have yet to be seen by our caller).
+ * For example, when a tuple "(a, b) = (42, 5)" advances the array
+ * keys on "a" from 40 to 45, we must also set "b" to whatever the
+ * first array element for "b" is. It would be wrong to allow "b" to
+ * be set to a value from the tuple, since the value is actually from
+ * a different part of the key space.
+ *
+ * Also perform the same steps with truncated high key attributes.
+ * You can think of this as a "binary search" for the element closest
+ * to the value -inf. This is another case where we have to avoid
+ * getting too far ahead of the scan.
+ */
+ if (!all_eqtype_sk_equal || attnum > ntupatts)
+ {
+ int first_elem_dir;
+
+ Assert((skrequiredtrigger && arrays_advanced) ||
+ attnum > ntupatts);
+ Assert(!beyond_end_advance);
+
+ if (ScanDirectionIsForward(dir) || !array)
+ first_elem_dir = 0;
+ else
+ first_elem_dir = array->num_elems - 1;
+
+ if (array && array->cur_elem != first_elem_dir)
+ {
+ array->cur_elem = first_elem_dir;
+ skeyarray->sk_argument = array->elem_values[first_elem_dir];
+ arrays_advanced = true;
+ }
+
+ continue;
+ }
+
+ /*
+ * Search in scankey's array for the corresponding tuple attribute
+ * value from caller's tuple
+ */
+ tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+ if (!array)
+ {
+ if (!skrequired)
+ continue;
+
+ /*
+ * This is a required non-array equality strategy scan key, which
+ * we'll treat as a degenerate single value array
+ */
+ result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+ cur->sk_argument, cur);
+ }
+ else
+ {
+ /* Determine if search bounds are reusable (optimization) */
+ bool cur_elem_start = (skrequired && !arrays_advanced);
+
+ /*
+ * Binary search for closest match that's available from the array
+ */
+ set_elem = _bt_binsrch_array_skey(orderproc, cur_elem_start, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
+ }
+
+ /* Consider advancing array keys */
+ Assert(!array || (set_elem >= 0 && set_elem < array->num_elems));
+ if (array && array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ skeyarray->sk_argument = array->elem_values[set_elem];
+ arrays_advanced = true;
+
+ /*
+ * We shouldn't have to advance a required array when called due
+ * to _bt_check_compare determining that a non-required array
+ * needs to be advanced. We expect _bt_check_compare to notice
+ * and report required scan keys before non-required ones.
+ */
+ Assert(skrequiredtrigger || !skrequired);
+ }
+
+ /*
+ * Consider "beyond end of array element" array advancement.
+ *
+ * When the tuple attribute value is > the closest matching array key
+ * (or < in the backwards scan case), we need to ratchet the array
+ * forward (backward) by one position, so that the array is set to a
+ * value < the tuple attribute value instead (or to a value > tuple's
+ * value).
+ *
+ * This process has to work for all of the arrays, not just this one:
+ * it must "carry" to higher-order arrays when the set_elem that we
+ * just used for this array happens to have been the final element
+ * (final for the current scan direction). That's why we don't handle
+ * this issue by modifying this array's set_elem (that won't "carry").
+ *
+ * Our approach is to set each subsequent lower-order array to its
+ * final element. We'll then advance the array keys incrementally,
+ * just outside the loop. That way earlier/higher order arrays
+ * (arrays before _this_ array) can advance as and when required.
+ *
+ * The array keys advance a little like the way that an mileage gauge
+ * advances. Imagine a mechanical display that rolls over from 999 to
+ * 000 every time we drive our car another 1,000 miles. Each decimal
+ * digit behaves a little like an array from the array state machine
+ * implemented by this function.
+ *
+ * Suppose we have 3 array keys a, b, and c. Each "digit"/array has
+ * 10 distinct elements that happen to match across each array: values
+ * 0 through to 9. Caller's tuple "(a, b, c) = (3, 7.9, 2)" might
+ * initially have its "b" array advanced up to the value 7 (7 being
+ * the closest match the "b" array has), and its "c" array advanced up
+ * to 9. The incremental advancement step (outside the loop) will
+ * then finish the process by "advancing" (actually, rolling over) the
+ * array on "c" to the value 0, which would immediately carry over to
+ * "b", which will then advance to the value 8 ("rounding up" from 7).
+ * Under this scheme, the array keys only ever ratchet forward, and
+ * array key advancement by us takes place as infrequently as possible
+ * (see also: this function's postcondition assertions, below).
+ *
+ * Incremental advancement can also carry all the way past the most
+ * significant array, exhausting all of the scan's array keys in one
+ * step. Suppose, for example, that a later call here passes a tuple
+ * "(a, b, c) = (9, 9.9, 4)". Once again we can't find an exact match
+ * for "b", so we'll set beyond_end_advance. This time, incremental
+ * advancement rolls over all the way past "a", the most significant
+ * array. _bt_advance_array_keys_increment will return false when
+ * this happens, indicating that all array keys are now exhausted.
+ * This triggers the end of the top-level index scan below.
+ */
+ Assert(!beyond_end_advance);
+ if (skrequired &&
+ ((ScanDirectionIsForward(dir) && result > 0) ||
+ (ScanDirectionIsBackward(dir) && result < 0)))
+ beyond_end_advance = true;
+
+ /*
+ * Also track whether all attributes from the tuple are equal to the
+ * array keys that we'll be advancing to (or to existing array keys
+ * that didn't need to be advanced)
+ */
+ if (result != 0)
+ {
+ all_eqtype_sk_equal = false;
+ if (skrequired)
+ all_required_eqtype_sk_equal = false;
+
+ /* Just skip if triggered by a non-required scan key */
+ if (!skrequiredtrigger)
+ break;
+ }
+ }
+
+ /*
+ * Consider if we need to advance the array keys incrementally to finish
+ * off "beyond end of array element" array advancement.
+ *
+ * Also fall back on incremental advancement in cases where we couldn't
+ * advance the array keys any other way. See function header comments for
+ * an example of this, where inequality-type scan keys alone drive array
+ * key advancement. (We don't directly deal with inequality type scan
+ * keys here, but cases that use the fallback must involve inequalities.)
+ */
+ arrays_exhausted = false;
+ if ((beyond_end_advance || !arrays_advanced) && skrequiredtrigger)
+ {
+ /* Fallback case must have all-equal equality type scan keys */
+ Assert(beyond_end_advance || all_required_eqtype_sk_equal);
+
+ if (!_bt_advance_array_keys_increment(scan, dir))
+ arrays_exhausted = true;
+ else
+ arrays_advanced = true;
+
+ /*
+ * The newly advanced array keys won't be equal anymore, so remember
+ * that in order to avoid a second _bt_check_compare call for tuple
+ */
+ all_eqtype_sk_equal = all_required_eqtype_sk_equal = false;
+ }
+
+ Assert(arrays_exhausted || arrays_advanced || !skrequiredtrigger);
+
+ /*
+ * If we haven't yet exhausted all required array scan keys, allow the
+ * ongoing primitive index scan to continue
+ */
+ pstate->continuescan = !arrays_exhausted;
+
+ /* Cannot set continuescan=false when called for non-required array */
+ Assert(pstate->continuescan || skrequiredtrigger);
+
+ if (arrays_advanced)
+ {
+ /*
+ * We advanced the array keys, and so must perform a targeted form of
+ * in-place preprocessing of the scan's search-type scan keys.
+ *
+ * If we missed this final step then any call to _bt_check_compare
+ * would use stale array keys until such time as _bt_preprocess_keys
+ * was once again called by _bt_first. But it's a good idea to do
+ * this even when there won't be another primitive index scan.
+ */
+ _bt_preprocess_keys_leafbuf(scan);
+
+ /*
+ * If any required array keys were advanced, be prepared to recheck
+ * the final tuple against the new array keys (as an optimization)
+ */
+ if (skrequiredtrigger)
+ pstate->finaltupchecked = false;
+ }
+
+ /*
+ * Postcondition assertions.
+ *
+ * Tuple must now be <= current/newly advanced required array keys. Same
+ * goes for other required equality type scan keys, which are "degenerate
+ * single value arrays" for our purposes. (As usual the rule is the same
+ * for backwards scans, but the operator is flipped: tuple must be >= new
+ * array keys.)
+ *
+ * We're stricter than that in cases where the tuple was already equal to
+ * the previous array keys when we were called: tuple must now be < the
+ * new array keys (or > the array keys). This is a consequence of the
+ * fallback on incremental advancement used to indirectly handle cases
+ * where an inequality triggers array key advancement. (See function
+ * header comments for an example of this.)
+ *
+ * Our caller decides when to start primitive index scans based in part on
+ * the current array keys. It always needs to see a precise array-wise
+ * picture of the scan's progress. If we ever advanced the array keys by
+ * less than the exact maximum safe amount, our caller might go on to make
+ * subtly wrong decisions about when to quit the ongoing primitive scan.
+ * (These assertions won't reliably detect every case where the array keys
+ * haven't advance by the expected/maximum amount, but they come close.)
+ */
+ Assert(_bt_verify_array_scankeys(scan));
+ Assert(_bt_tuple_before_array_skeys(scan, pstate, tuple) ==
+ (!all_required_eqtype_sk_equal && !arrays_exhausted));
+
+ /* All-equal required equality keys shouldn't be from before this call */
+ Assert(!all_required_eqtype_sk_equal || !skrequiredtrigger ||
+ arrays_advanced || arrays_exhausted);
+
+ return all_eqtype_sk_equal && pstate->continuescan;
+}
/*
* _bt_preprocess_keys() -- Preprocess scan keys
@@ -749,6 +1767,21 @@ _bt_restore_array_keys(IndexScanDesc scan)
* Again, missing cross-type operators might cause us to fail to prove the
* quals contradictory when they really are, but the scan will work correctly.
*
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here. They
+ * expect to be able to call _bt_preprocess_keys_leafbuf instead (a stripped
+ * down version of this function that's specialized to array key index scans).
+ * We need to be careful about that case here when we determine redundancy;
+ * equality quals must not be eliminated as redundant on the basis of array
+ * input keys that might change before another call here takes place.
+ *
+ * Note, however, that the presence of an array scan key doesn't affect how we
+ * determine if index quals are contradictory. Contradictory qual scans move
+ * on to the next primitive index scan right away, by incrementing the scan's
+ * array keys once control reaches _bt_array_keys_remain. There won't ever be
+ * a call to _bt_preprocess_keys_leafbuf before the next call here, so there
+ * is nothing for us to break.
+ *
* Row comparison keys are currently also treated without any smarts:
* we just transfer them into the preprocessed array without any
* editorialization. We can treat them the same as an ordinary inequality
@@ -895,8 +1928,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
so->qual_ok = false;
return;
}
- /* else discard the redundant non-equality key */
- xform[j] = NULL;
+ else if (!(eq->sk_flags & SK_SEARCHARRAY))
+ {
+ /* else discard the redundant non-equality key */
+ xform[j] = NULL;
+ }
}
/* else, cannot determine redundancy, keep both keys */
}
@@ -986,6 +2022,22 @@ _bt_preprocess_keys(IndexScanDesc scan)
continue;
}
+ /*
+ * Is this an array scan key that _bt_preprocess_array_keys merged
+ * with some earlier array key during its initial preprocessing pass?
+ */
+ if (cur->sk_flags & SK_BT_RDDNARRAY)
+ {
+ /*
+ * key is redundant for this primitive index scan (and will be
+ * redundant during all subsequent primitive index scans)
+ */
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(j == (BTEqualStrategyNumber - 1));
+ Assert(so->numArrayKeys > 0);
+ continue;
+ }
+
/* have we seen one of these before? */
if (xform[j] == NULL)
{
@@ -999,7 +2051,26 @@ _bt_preprocess_keys(IndexScanDesc scan)
&test_result))
{
if (test_result)
- xform[j] = cur;
+ {
+ if (j == (BTEqualStrategyNumber - 1) &&
+ ((xform[j]->sk_flags & SK_SEARCHARRAY) ||
+ (cur->sk_flags & SK_SEARCHARRAY)))
+ {
+ /*
+ * Must never replace an = array operator ourselves,
+ * nor can we ever fail to remember an = array
+ * operator. _bt_preprocess_keys_leafbuf expects
+ * this.
+ */
+ ScanKey outkey = &outkeys[new_numberOfKeys++];
+
+ memcpy(outkey, cur, sizeof(ScanKeyData));
+ if (numberOfEqualCols == attno - 1)
+ _bt_mark_scankey_required(outkey);
+ }
+ else
+ xform[j] = cur;
+ }
else if (j == (BTEqualStrategyNumber - 1))
{
/* key == a && key == b, but a != b */
@@ -1027,6 +2098,96 @@ _bt_preprocess_keys(IndexScanDesc scan)
so->numberOfKeys = new_numberOfKeys;
}
+/*
+ * _bt_preprocess_keys_leafbuf() -- Preprocess array scan keys only
+ *
+ * Stripped down version of _bt_preprocess_keys that can be called with a
+ * buffer lock held. Reuses much of the work performed during the previous
+ * _bt_preprocess_keys call.
+ *
+ * This function just transfers newly advanced array keys that were set in
+ * "so->arrayKeyData" to corresponding "so->keyData" search-type scan keys.
+ * It does not independently detect redunant or contradictory scan keys.
+ */
+static void
+_bt_preprocess_keys_leafbuf(IndexScanDesc scan)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0;
+
+ Assert(so->qual_ok);
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array;
+ ScanKey skeyarray;
+
+ Assert((cur->sk_flags & SK_BT_RDDNARRAY) == 0);
+
+ /* Just update equality array scan keys */
+ if (cur->sk_strategy != BTEqualStrategyNumber ||
+ !(cur->sk_flags & SK_SEARCHARRAY))
+ continue;
+
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+
+ /* Update the scan key's argument */
+ Assert(cur->sk_attno == skeyarray->sk_attno);
+ cur->sk_argument = skeyarray->sk_argument;
+ }
+
+ Assert(arrayidx == so->numArrayKeys);
+}
+
+/*
+ * Verify that the scan's "so->arrayKeyData" scan keys are in agreement with
+ * the current "so->keyData" search-type scan keys. Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_verify_array_scankeys(IndexScanDesc scan)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0;
+
+ if (!so->qual_ok)
+ return false;
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array;
+ ScanKey skeyarray;
+
+ if (cur->sk_strategy != BTEqualStrategyNumber ||
+ !(cur->sk_flags & SK_SEARCHARRAY))
+ continue;
+
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+
+ /* Verify so->arrayKeyData input scan key has expected sk_argument */
+ if (skeyarray->sk_argument != array->elem_values[array->cur_elem])
+ return false;
+
+ /* Verify so->arrayKeyData input scan key agrees with output key */
+ if (cur->sk_attno != skeyarray->sk_attno)
+ return false;
+ if (cur->sk_argument != skeyarray->sk_argument)
+ return false;
+ }
+
+ if (arrayidx != so->numArrayKeys)
+ return false;
+
+ return true;
+}
+#endif
+
/*
* Compare two scankey values using a specified operator.
*
@@ -1360,41 +2521,198 @@ _bt_mark_scankey_required(ScanKey skey)
*
* Return true if so, false if not. If the tuple fails to pass the qual,
* we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
+ * this tuple, and set pstate.continuescan accordingly. See comments for
* _bt_preprocess_keys(), above, about how this is done.
*
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the final
+ * tuple (the high key for a forward scan) early, before we've expended too
+ * much effort on comparing tuples that cannot possibly be matches for any set
+ * of array keys. This is just an optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate. These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards). Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction). Any other order will
+ * lead to inconsistent array key state.
*
* scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
* tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
* requiredMatchedByPrecheck: indicates that scan keys required for
* direction scan are already matched
*/
bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup,
bool requiredMatchedByPrecheck)
{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
+ TupleDesc tupdesc = RelationGetDescr(scan->indexRelation);
+ int natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool res;
+ bool skrequiredtrigger;
+
+ Assert(pstate->continuescan);
+ Assert(!so->needPrimScan);
+
+ res = _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger,
+ requiredMatchedByPrecheck);
+
+ /*
+ * Only one _bt_check_compare call is required in the common case where
+ * there are no equality-type array scan keys.
+ *
+ * When there are array scan keys then we can still accept the first
+ * answer we get from _bt_check_compare when continuescan wasn't unset.
+ */
+ if (!so->numArrayKeys || pstate->continuescan)
+ return res;
+
+ /*
+ * _bt_check_compare set continuescan=false in the presence of equality
+ * type array keys. It's possible that we haven't reached the start of
+ * the array keys just yet. It's also possible that we need to advance
+ * the array keys now. (Or perhaps we really do need to terminate the
+ * top-level scan.)
+ */
+ pstate->continuescan = true; /* new initial assumption */
+
+ if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+ {
+ /*
+ * Tuple is still < the current array scan key values (as well as
+ * other equality type scan keys) if this is a forward scan.
+ * (Backwards scans reach here with a tuple > equality constraints.)
+ * We must now consider how to proceed with the ongoing primitive
+ * index scan.
+ *
+ * Should _bt_readpage continue with this page for now, in the hope of
+ * finding tuples whose key space is covered by the current array keys
+ * before too long? Or, should it give up and start a new primitive
+ * index scan instead?
+ *
+ * Our policy is to terminate the primitive index scan at the end of
+ * the current page if the current (most recently advanced) array keys
+ * don't cover the final tuple from the page. This policy is fairly
+ * conservative overall. Note, however, that our policy effectively
+ * infers what the next sibling page is likely to look like based on
+ * details from the current page (in particular its final tuple).
+ *
+ * It's possible that we'll gamble and lose: a grouping of tuples
+ * covered by the current array keys could be aligned with the key
+ * space boundaries of the current leaf page, without any later array
+ * keys having key space that is covered by the next sibling page.
+ */
+ if (finaltup || (!pstate->finaltupchecked && pstate->finaltup &&
+ _bt_tuple_before_array_skeys(scan, pstate,
+ pstate->finaltup)))
+ {
+ /*
+ * This is the final tuple (the high key for forward scans, or the
+ * tuple at the first offset number for backward scans), but it is
+ * still before the current array keys. As such, we're unwilling
+ * to allow the current primitive index scan to continue to the
+ * next leaf page. Start a new primitive index scan that will
+ * reposition the top-level scan to the first leaf page whose key
+ * space is covered by our _current_ array keys. We expect that
+ * this process will effectively make the scan "skip over" a group
+ * of leaf pages that cannot possibly contain any matching tuples.
+ *
+ * Note: _bt_readpage stashes the final tuple, which allows us to
+ * make this check early. We thereby avoid comparing very many
+ * extra tuples on the page. This is just an optimization;
+ * skipping these useless comparisons should never change our
+ * final conclusion about what the scan should do next.
+ */
+ pstate->continuescan = false;
+ so->needPrimScan = true;
+ }
+ else if (!finaltup && pstate->finaltup)
+ {
+ /*
+ * Remember that the final tuple has been checked with this
+ * particular set of array keys.
+ *
+ * It might make sense to check the same tuple again at some point
+ * during the ongoing _bt_readpage-wise scan of this page. But it
+ * is definitely wasteful to repeat the same check before the
+ * array keys are advanced by some later non-final tuple.
+ */
+ pstate->finaltupchecked = true;
+ }
+
+ /*
+ * In any case, this indextuple doesn't match the qual
+ */
+ return false;
+ }
+
+ /*
+ * Caller's tuple is >= the current set of array keys and other equality
+ * constraint scan keys (or <= if this is a backwards scans).
+ *
+ * It it now time to advance the array keys based on the values from this
+ * tuple. Do that now, while determining in passing if the tuple matches
+ * the newly advanced set of array keys (if we've any left).
+ *
+ * This call will also set continuescan for us (or tells us to perform
+ * another _bt_check_compare call, which then sets continuescan for us).
+ */
+ if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+ {
+ /*
+ * Tuple doesn't match any later array keys, either. Give up on this
+ * tuple being a match.
+ */
+ return false;
+ }
+
+ /*
+ * Advanced array keys to values that are exact matches for corresponding
+ * attribute values from the tuple. Check back with _bt_check_compare.
+ */
+ return _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+ &pstate->continuescan, &skrequiredtrigger,
+ false);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys. It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan. It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, bool *skrequiredtrigger,
+ bool requiredMatchedByPrecheck)
+{
int ikey;
ScanKey key;
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+ Assert(!so->numArrayKeys || !requiredMatchedByPrecheck);
*continuescan = true; /* default assumption */
+ *skrequiredtrigger = true; /* default assumption */
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ for (key = so->keyData, ikey = 0; ikey < so->numberOfKeys; key++, ikey++)
{
Datum datum;
bool isNull;
@@ -1526,7 +2844,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* _bt_first() except for the NULLs checking, which have already done
* above.
*/
- if (!requiredOppositeDir)
+ if (!requiredOppositeDir || so->numArrayKeys)
{
test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
datum, key->sk_argument);
@@ -1549,10 +2867,22 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* qual fails, it is critical that equality quals be used for the
* initial positioning in _bt_first() when they are available. See
* comments in _bt_first().
+ *
+ * Scans with equality-type array scan keys run into a similar
+ * problem whenever they advance the array keys. Our caller uses
+ * _bt_tuple_before_array_skeys to avoid the problem there.
*/
if (requiredSameDir)
*continuescan = false;
+ if ((key->sk_flags & SK_SEARCHARRAY) &&
+ key->sk_strategy == BTEqualStrategyNumber)
+ {
+ if (!requiredSameDir)
+ *skrequiredtrigger = false;
+ *continuescan = false;
+ }
+
/*
* In any case, this indextuple doesn't match the qual.
*/
@@ -1571,7 +2901,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* it's not possible for any future tuples in the current scan direction
* to pass the qual.
*
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
*/
static bool
_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 03a5fbdc6..e37597c26 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop);
+ bool *skip_nonnative_saop);
static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
List *clauses, List *other_clauses);
static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
* index AM supports them natively, we should just include them in simple
* index paths. If not, we should exclude them while building simple index
* paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
*/
static void
get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
{
List *indexpaths;
bool skip_nonnative_saop = false;
- bool skip_lower_saop = false;
ListCell *lc;
/*
* Build simple index paths using the clauses. Allow ScalarArrayOpExpr
- * clauses only if the index AM supports them natively, and skip any such
- * clauses for index columns after the first (so that we produce ordered
- * paths if possible).
+ * clauses only if the index AM supports them natively.
*/
indexpaths = build_index_paths(root, rel,
index, clauses,
index->predOK,
ST_ANYSCAN,
- &skip_nonnative_saop,
- &skip_lower_saop);
-
- /*
- * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
- * that supports them, then try again including those clauses. This will
- * produce paths with more selectivity but no ordering.
- */
- if (skip_lower_saop)
- {
- indexpaths = list_concat(indexpaths,
- build_index_paths(root, rel,
- index, clauses,
- index->predOK,
- ST_ANYSCAN,
- &skip_nonnative_saop,
- NULL));
- }
+ &skip_nonnative_saop);
/*
* Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
index, clauses,
false,
ST_BITMAPSCAN,
- NULL,
NULL);
*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
* to true if we found any such clauses (caller must initialize the variable
* to false). If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
*
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false). If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
* 'rel' is the index's heap relation
* 'index' is the index for which we want to generate paths
* 'clauses' is the collection of indexable clauses (IndexClause nodes)
* 'useful_predicate' indicates whether the index has a useful predicate
* 'scantype' indicates whether we need plain or bitmap scan support
* 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
*/
static List *
build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop)
+ bool *skip_nonnative_saop)
{
List *result = NIL;
IndexPath *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
- bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
bool index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
* on by btree and possibly other places.) The list can be empty, if the
* index AM allows that.
*
- * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
- * index clause for a non-first index column. This prevents us from
- * assuming that the scan result is ordered. (Actually, the result is
- * still ordered if there are equality constraints for all earlier
- * columns, but it seems too expensive and non-modular for this code to be
- * aware of that refinement.)
- *
* We also build a Relids set showing which outer rels are required by the
* selected clauses. Any lateral_relids are included in that, but not
* otherwise accounted for.
*/
index_clauses = NIL;
- found_lower_saop_clause = false;
outer_relids = bms_copy(rel->lateral_relids);
for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
{
@@ -903,30 +862,20 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexClause *iclause = (IndexClause *) lfirst(lc);
RestrictInfo *rinfo = iclause->rinfo;
- /* We might need to omit ScalarArrayOpExpr clauses */
- if (IsA(rinfo->clause, ScalarArrayOpExpr))
+ /*
+ * We might need to omit ScalarArrayOpExpr clauses when index AM
+ * lacks native support
+ */
+ if (!index->amsearcharray && IsA(rinfo->clause, ScalarArrayOpExpr))
{
- if (!index->amsearcharray)
+ if (skip_nonnative_saop)
{
- if (skip_nonnative_saop)
- {
- /* Ignore because not supported by index */
- *skip_nonnative_saop = true;
- continue;
- }
- /* Caller had better intend this only for bitmap scan */
- Assert(scantype == ST_BITMAPSCAN);
- }
- if (indexcol > 0)
- {
- if (skip_lower_saop)
- {
- /* Caller doesn't want to lose index ordering */
- *skip_lower_saop = true;
- continue;
- }
- found_lower_saop_clause = true;
+ /* Ignore because not supported by index */
+ *skip_nonnative_saop = true;
+ continue;
}
+ /* Caller had better intend this only for bitmap scan */
+ Assert(scantype == ST_BITMAPSCAN);
}
/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/*
* 2. Compute pathkeys describing index's ordering, if any, then see how
* many of them are actually useful for this query. This is not relevant
- * if we are only trying to build bitmap indexscans, nor if we have to
- * assume the scan is unordered.
+ * if we are only trying to build bitmap indexscans.
*/
pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
- !found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
index, &clauseset,
useful_predicate,
ST_BITMAPSCAN,
- NULL,
NULL);
result = list_concat(result, indexpaths);
}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..1b899b2db 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
double numIndexTuples;
double spc_random_page_cost;
double num_sa_scans;
- double num_outer_scans;
- double num_scans;
double qual_op_cost;
double qual_arg_cost;
List *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
/*
* Check for ScalarArrayOpExpr index quals, and estimate the number of
- * index scans that will be performed.
+ * primitive index scans that will be performed for caller
*/
num_sa_scans = 1;
foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
*/
numIndexTuples = costs->numIndexTuples;
if (numIndexTuples <= 0.0)
- {
numIndexTuples = indexSelectivity * index->rel->tuples;
- /*
- * The above calculation counts all the tuples visited across all
- * scans induced by ScalarArrayOpExpr nodes. We want to consider the
- * average per-indexscan number, so adjust. This is a handy place to
- * round to integer, too. (If caller supplied tuple estimate, it's
- * responsible for handling these considerations.)
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
- }
-
/*
* We can bound the number of tuples by the index size in any case. Also,
* always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
*
* The above calculations are all per-index-scan. However, if we are in a
* nestloop inner scan, we can expect the scan to be repeated (with
- * different search keys) for each row of the outer relation. Likewise,
- * ScalarArrayOpExpr quals result in multiple index scans. This creates
- * the potential for cache effects to reduce the number of disk page
- * fetches needed. We want to estimate the average per-scan I/O cost in
- * the presence of caching.
+ * different search keys) for each row of the outer relation. This
+ * creates the potential for cache effects to reduce the number of disk
+ * page fetches needed. We want to estimate the average per-scan I/O cost
+ * in the presence of caching.
*
* We use the Mackert-Lohman formula (see costsize.c for details) to
* estimate the total number of page fetches that occur. While this
* wasn't what it was designed for, it seems a reasonable model anyway.
* Note that we are counting pages not tuples anymore, so we take N = T =
* index size, as if there were one "tuple" per page.
+ *
+ * Note: we assume that there will be no repeat index page fetches across
+ * ScalarArrayOpExpr primitive scans from the same logical index scan.
+ * This is guaranteed to be true for btree indexes, but is very optimistic
+ * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+ * However, these same index AMs also accept our default pessimistic
+ * approach to counting num_sa_scans (btree caller caps this), so we don't
+ * expect the final indexTotalCost to be wildly over-optimistic.
*/
- num_outer_scans = loop_count;
- num_scans = num_sa_scans * num_outer_scans;
-
- if (num_scans > 1)
+ if (loop_count > 1)
{
double pages_fetched;
/* total page fetches ignoring cache effects */
- pages_fetched = numIndexPages * num_scans;
+ pages_fetched = numIndexPages * loop_count;
/* use Mackert and Lohman formula to adjust for cache effects */
pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
/*
* Now compute the total disk access cost, and then report a pro-rated
- * share for each outer scan. (Don't pro-rate for ScalarArrayOpExpr,
- * since that's internal to the indexscan.)
+ * share for each outer scan
*/
- indexTotalCost = (pages_fetched * spc_random_page_cost)
- / num_outer_scans;
+ indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
}
else
{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
* evaluated once at the start of the scan to reduce them to runtime keys
* to pass to the index AM (see nodeIndexscan.c). We model the per-tuple
* CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
- * indexqual operator. Because we have numIndexTuples as a per-scan
- * number, we have to multiply by num_sa_scans to get the correct result
- * for ScalarArrayOpExpr cases. Similarly add in costs for any index
- * ORDER BY expressions.
+ * indexqual operator. Similarly add in costs for any index ORDER BY
+ * expressions.
*
* Note: this neglects the possible costs of rechecking lossy operators.
* Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
indexStartupCost = qual_arg_cost;
indexTotalCost += qual_arg_cost;
- indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+ indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
/*
* Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
bool eqQualHere;
bool found_saop;
bool found_is_null_op;
- double num_sa_scans;
ListCell *lc;
/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
*
* For a RowCompareExpr, we consider only the first column, just as
* rowcomparesel() does.
- *
- * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
- * index scans not one, but the ScalarArrayOpExpr's operator can be
- * considered to act the same as it normally does.
*/
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
found_saop = false;
found_is_null_op = false;
- num_sa_scans = 1;
foreach(lc, path->indexclauses)
{
IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
else if (IsA(clause, ScalarArrayOpExpr))
{
ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
- Node *other_operand = (Node *) lsecond(saop->args);
- int alength = estimate_array_length(other_operand);
clause_op = saop->opno;
found_saop = true;
- /* count number of SA scans induced by indexBoundQuals only */
- if (alength > 1)
- num_sa_scans *= alength;
}
else if (IsA(clause, NullTest))
{
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
JOIN_INNER,
NULL);
numIndexTuples = btreeSelectivity * index->rel->tuples;
-
- /*
- * As in genericcostestimate(), we have to adjust for any
- * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
- * to integer.
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
}
/*
@@ -6832,6 +6801,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
genericcostestimate(root, path, loop_count, &costs);
+ /*
+ * Now compensate for btree's ability to efficiently execute scans with
+ * SAOP clauses.
+ *
+ * btree automatically combines individual ScalarArrayOpExpr primitive
+ * index scans whenever the tuples covered by the next set of array keys
+ * are close to tuples covered by the current set. This makes the final
+ * number of descents particularly difficult to estimate. However, btree
+ * scans never visit any single leaf page more than once. That puts a
+ * natural floor under the worst case number of descents.
+ *
+ * It's particularly important that we not wildly overestimate the number
+ * of descents needed for a clause list with several SAOPs -- the costs
+ * really aren't multiplicative in the way genericcostestimate expects. In
+ * general, most distinct combinations of SAOP keys will tend to not find
+ * any matching tuples. Furthermore, btree scans search for the next set
+ * of array keys using the next tuple in line, and so won't even need a
+ * direct comparison to eliminate most non-matching sets of array keys.
+ *
+ * Clamp the number of descents to the estimated number of leaf page
+ * visits. This is still fairly pessimistic, but tends to result in more
+ * accurate costing of scans with several SAOP clauses -- especially when
+ * each array has more than a few elements. The cost of adding additional
+ * array constants to a low-order SAOP column should saturate past a
+ * certain point (except where selectivity estimates continue to shift).
+ *
+ * Also clamp the number of descents to 1/3 the number of index pages.
+ * This avoids implausibly high estimates with low selectivity paths,
+ * where scans frequently require no more than one or two descents.
+ *
+ * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+ * clause quals (which the B-Tree code uses "non-required" scan keys for)
+ * won't actually contribute to the total number of descents of the index.
+ * This would require pushing down more context into genericcostestimate.
+ */
+ if (costs.num_sa_scans > 1)
+ {
+ costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+ costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+ costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+ }
+
/*
* Add a CPU-cost component to represent the costs of initial btree
* descent. We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6850,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* comparisons to descend a btree of N leaf tuples. We charge one
* cpu_operator_cost per comparison.
*
- * If there are ScalarArrayOpExprs, charge this once per SA scan. The
- * ones after the first one are not startup cost so far as the overall
- * plan is concerned, so add them only to "total" cost.
+ * If there are ScalarArrayOpExprs, charge this once per estimated
+ * primitive SA scan. The ones after the first one are not startup cost
+ * so far as the overall plan goes, so just add them to "total" cost.
*/
if (index->tuples > 1) /* avoid computing log(0) */
{
@@ -6858,7 +6869,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* in cases where only a single leaf page is expected to be visited. This
* cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
* touched. The number of such pages is btree tree height plus one (ie,
- * we charge for the leaf page too). As above, charge once per SA scan.
+ * we charge for the leaf page too). As above, charge once per estimated
+ * primitive SA scan.
*/
descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e068f7e24..da90412d5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</para>
</note>
+ <note>
+ <para>
+ Every time an index is searched, the index's
+ <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+ field is incremented. This usually happens once per index scan node
+ execution, but might take place several times during execution of a scan
+ that searches for multiple values together. Only queries that use certain
+ <acronym>SQL</acronym> constructs to search for rows matching any value
+ out of a list (or an array) of multiple scalar values are affected. See
+ <xref linkend="functions-comparisons"/> for details.
+ </para>
+ </note>
+
</sect2>
<sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
(1 row)
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
--------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------
Index Only Scan using tenk1_thous_tenthous on tenk1
- Index Cond: (thousand < 2)
- Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
SET enable_indexonlyscan = OFF;
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
- Sort Key: thousand
- -> Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
RESET enable_indexonlyscan;
--
-- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 892ea5f17..f4939cd74 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
Merge Cond: (j1.id1 = j2.id1)
Join Filter: (j2.id2 = j1.id2)
-> Index Scan using j1_id1_idx on j1
- -> Index Only Scan using j2_pkey on j2
+ -> Index Scan using j2_id1_idx on j2
Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
- Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
select * from j1
inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
SET enable_indexonlyscan = OFF;
explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
RESET enable_indexonlyscan;
--
--
2.42.0
On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
If you end up finding a bug in this v6, it'll most likely be a case
where nbtree fails to live up to that. This project is as much about
robust/predictable performance as anything else -- nbtree needs to be
able to cope with practically anything. I suggest that your review
start by trying to break the patch along these lines.
I spent some time on this myself today (which I'd already planned on).
Attached is an adversarial stress-test, which shows something that
must be approaching the worst case for the patch in terms of time
spent with a buffer lock held, due to spending so much time evaluating
unusually expensive SAOP index quals. The array binary searches that
take place with a buffer lock held aren't quite like anything else
that nbtree can do right now, so it's worthy of special attention.
I thought of several factors that maximize both the number of binary
searches within any given _bt_readpage, as well as the cost of each
binary search -- the SQL file has full details. My test query is
*extremely* unrealistic, since it combines multiple independent
unrealistic factors, all of which aim to make life hard for the
implementation in one way or another. I hesitate to say that it
couldn't be much worse (I only spent a few hours on this), but I'm
prepared to say that it seems very unlikely that any real world query
could make the patch spend as many cycles in
_bt_readpage/_bt_checkkeys as this one does.
Perhaps you can think of some other factor that would make this test
case even less sympathetic towards the patch, Matthias? The only thing
I thought of that I've left out was the use of a custom btree opclass,
"unrealistically_slow_ops". Something that calls pg_usleep in its
order proc. (I left it out because it wouldn't prove anything.)
On my machine, custom instrumentation shows that each call to
_bt_readpage made while this query executes (on a patched server)
takes just under 1.4 milliseconds. While that is far longer than it
usually takes, it's basically acceptable IMV. It's not significantly
longer than I'd expect heap_index_delete_tuples() to take on an
average day with EBS (or other network-attached storage). But that's a
process that happens all the time, with an exclusive buffer lock held
on the leaf page throughout -- whereas this is only a shared buffer
lock, and involves a query that's just absurd .
Another factor that makes this seem acceptable is just how sensitive
the test case is to everything going exactly and perfectly wrong, all
at the same time, again and again. The test case uses a 32 column
index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
clauses (one per index column). If I reduce the number of SAOP clauses
in the query to (say) 8, I still have a test case that's almost as
silly as my original -- but now we only spend ~225 microseconds in
each _bt_readpage call (i.e. we spend over 6x less time in each
_bt_readpage call). (Admittedly if I also make the CREATE INDEX use
only 8 columns, we can fit more index tuples on one page, leaving us
at ~800 microseconds).
I'm a little surprised that it isn't a lot worse than this, given how
far I went. I was a little concerned that it would prove necessary to
lock this kind of thing down at some higher level (e.g., in the
planner), but that now looks unnecessary. There are much better ways
to DOS the server than this. For example, you could run this same
query while forcing a sequential scan! That appears to be quite a lot
less responsive to interrupts (in addition to being hopelessly slow),
probably because it uses parallel workers, each of which will use
wildly expensive filter quals that just do a linear scan of the SAOP.
--
Peter Geoghegan
Attachments:
On Fri, 10 Nov 2023 at 00:58, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
If you end up finding a bug in this v6, it'll most likely be a case
where nbtree fails to live up to that. This project is as much about
robust/predictable performance as anything else -- nbtree needs to be
able to cope with practically anything. I suggest that your review
start by trying to break the patch along these lines.I spent some time on this myself today (which I'd already planned on).
Attached is an adversarial stress-test, which shows something that
must be approaching the worst case for the patch in terms of time
spent with a buffer lock held, due to spending so much time evaluating
unusually expensive SAOP index quals. The array binary searches that
take place with a buffer lock held aren't quite like anything else
that nbtree can do right now, so it's worthy of special attention.I thought of several factors that maximize both the number of binary
searches within any given _bt_readpage, as well as the cost of each
binary search -- the SQL file has full details. My test query is
*extremely* unrealistic, since it combines multiple independent
unrealistic factors, all of which aim to make life hard for the
implementation in one way or another. I hesitate to say that it
couldn't be much worse (I only spent a few hours on this), but I'm
prepared to say that it seems very unlikely that any real world query
could make the patch spend as many cycles in
_bt_readpage/_bt_checkkeys as this one does.Perhaps you can think of some other factor that would make this test
case even less sympathetic towards the patch, Matthias? The only thing
I thought of that I've left out was the use of a custom btree opclass,
"unrealistically_slow_ops". Something that calls pg_usleep in its
order proc. (I left it out because it wouldn't prove anything.)
Have you tried using text index columns that are sorted with
non-default locales?
I've seen non-default locales use significantly more resources during
compare operations than any other ordering operation I know of (which
has mostly been in finding the locale), and use it extensively to test
improvements for worst index shapes over at my btree patchsets because
locales are dynamically loaded in text compare and nondefault locales
are not cached at all. I suspect that this would be even worse if a
somehow even worse locale path is available than what I'm using for
test right now; this could be the case with complex custom ICU
locales.
On my machine, custom instrumentation shows that each call to
_bt_readpage made while this query executes (on a patched server)
takes just under 1.4 milliseconds. While that is far longer than it
usually takes, it's basically acceptable IMV. It's not significantly
longer than I'd expect heap_index_delete_tuples() to take on an
average day with EBS (or other network-attached storage). But that's a
process that happens all the time, with an exclusive buffer lock held
on the leaf page throughout -- whereas this is only a shared buffer
lock, and involves a query that's just absurd .Another factor that makes this seem acceptable is just how sensitive
the test case is to everything going exactly and perfectly wrong, all
at the same time, again and again. The test case uses a 32 column
index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
clauses (one per index column). If I reduce the number of SAOP clauses
in the query to (say) 8, I still have a test case that's almost as
silly as my original -- but now we only spend ~225 microseconds in
each _bt_readpage call (i.e. we spend over 6x less time in each
_bt_readpage call). (Admittedly if I also make the CREATE INDEX use
only 8 columns, we can fit more index tuples on one page, leaving us
at ~800 microseconds).
A quick update of the table definition to use the various installed
'fr-%-x-icu' locales on text hash columns instead of numeric with a
different collation for each column this gets me to EXPLAIN (analyze)
showing 2.07ms spent every buffer hit inside the index scan node, as
opposed to 1.76ms when using numeric. But, as you mention, the value
of this metric is probably not very high.
As for the patch itself, I'm probably about 50% through the patch now.
While reviewing, I noticed the following two user-visible items,
related to SAOP but not broken by or touched upon in this patch:
1. We don't seem to plan `column opr ALL (...)` as index conditions,
while this should be trivial to optimize for at least btree. Example:
SET enable_bitmapscan = OFF;
WITH a AS (select generate_series(1, 1000) a)
SELECT * FROM tenk1
WHERE thousand = ANY (array(table a))
AND thousand < ALL (array(table a));
This will never return any rows, but it does hit 9990 buffers in the
new btree code, while I expected that to be 0 buffers based on the
query and index (that is, I expected to hit 0 buffers, until I
realized that we don't push ALL into index filters). I shall assume
ALL isn't used all that often (heh), but it sure feels like we're
missing out on performance here.
2. We also don't seem to support array keys for row compares, which
probably is an even more niche use case:
SELECT count(*)
FROM tenk1
WHERE (thousand, tenthous) = ANY (ARRAY[(1, 1), (1, 2), (2, 1)]);
This is no different from master, too, but it'd be nice if there was
support for arrays of row operations, too, just so that composite
primary keys can also be looked up with SAOPs.
Kind regards,
Matthias van de Meent
On Wed, 8 Nov 2023 at 02:53, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:
I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review.Okay, thanks for the update, then I'll wait for v6 to be posted.
On second thought, I'll just post v6 now (there won't be conflicts
against the master branch once the other patch is committed anyway).
Thanks. Here's my review of the btree-related code:
+++ b/src/backend/access/nbtree/nbtsearch.c @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) * set flag to true if all required keys are satisfied and false * otherwise. */ - (void) _bt_checkkeys(scan, itup, indnatts, dir, - &requiredMatchedByPrecheck, false); + _bt_checkkeys(scan, &pstate, itup, false, false); + requiredMatchedByPrecheck = pstate.continuescan; + pstate.continuescan = true; /* reset */
The comment above the updated section needs to be updated.
@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) * set flag to true if all required keys are satisfied and false * otherwise. */ - (void) _bt_checkkeys(scan, itup, indnatts, dir, - &requiredMatchedByPrecheck, false); + _bt_checkkeys(scan, &pstate, itup, false, false);
This 'false' finaltup argument is surely wrong for the rightmost
page's rightmost tuple, no?
+++ b/src/backend/access/nbtree/nbtutils.c @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan) + /* We could pfree(elem_values) after, but not worth the cycles */ + num_elems = _bt_merge_arrays(scan, cur, + (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0, + prev->elem_values, prev->num_elems, + elem_values, num_elems);
This code can get hit several times when there are multiple = ANY
clauses, which may result in repeated leakage of these arrays during
this scan. I think cleaning up may well be worth the cycles when the
total size of the arrays is large enough.
@@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey, _bt_compare_array_elements, &cxt); +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse, + Datum *elems_orig, int nelems_orig, + Datum *elems_next, int nelems_next) [...] + /* + * Incrementally copy the original array into a temp buffer, skipping over + * any items that are missing from the "next" array + */
Given that we only keep the members that both arrays have in common,
the result array will be a strict subset of the original array. So, I
don't quite see why we need the temporary buffer here - we can reuse
the entries of the elems_orig array that we've already compared
against the elems_next array.
We may want to optimize this further by iterating over only the
smallest array: With the current code, [1, 2] + [1....1000] is faster
to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
smaller than 1000*log(2). In practice this may matter very little,
though.
An even better optimized version would do a merge join on the two
arrays, rather than loop + binary search.
@@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg) [...] +_bt_binsrch_array_skey(FmgrInfo *orderproc,
Is there a reason for this complex initialization of high/low_elem,
rather than the this easier to understand and more compact
initialization?:
+ low_elem = 0;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ {
+ if (ScanDirectionIsForward(dir))
+ low_elem = array->cur_elem;
+ else
+ high_elem = array->cur_elem;
+ }
@@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan) [...] + _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir) [...] + if (scan->parallel_scan != NULL) + _bt_parallel_done(scan); + + /* + * No more primitive index scans. Terminate the top-level scan. + */ + return false;
I think the conditional _bt_parallel_done(scan) feels misplaced here,
as the comment immediately below indicates the scan is to be
terminated after that comment. So, either move this _bt_parallel_done
call outside the function (which by name would imply it is read-only,
without side effects like this) or move it below the comment
"terminate the top-level scan".
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate, [...] + * Set up ORDER 3-way comparison function and array state [...] + * Optimization: Skip over non-required scan keys when we know that
These two sections should probably be swapped, as the skip makes the
setup useless.
Also, the comment here is wrong; the scan keys that are skipped are
'required', not 'non-required'.
+++ b/src/test/regress/expected/join.out @@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]); Merge Cond: (j1.id1 = j2.id1) Join Filter: (j2.id2 = j1.id2) -> Index Scan using j1_id1_idx on j1 - -> Index Only Scan using j2_pkey on j2 + -> Index Scan using j2_id1_idx on j2 Index Cond: (id1 >= ANY ('{1,5}'::integer[])) - Filter: ((id1 % 1000) = 1) -(7 rows) +(6 rows)
I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter
anymore. The output otherwise matches (quite possibly because the
other join conditions don't match) and I don't have time to
investigate the intricacies between IOS vs normal IS, but this feels
off.
----
As for the planner changes, I don't think I'm familiar enough with the
planner to make any authorative comments on this. However, it does
look like you've changed the meaning of 'amsearcharray', and I'm not
sure it's OK to assume all indexes that support amsearcharray will
also support for this new assumption of ordered retrieval of SAOPs.
For one, the pgroonga extension [0]https://github.com/pgroonga/pgroonga/blob/115414723c7eb8ce9eb667da98e008bd10fbae0a/src/pgroonga.c#L8782-L8788 does mark
amcanorder+amsearcharray.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Thanks. Here's my review of the btree-related code:
Attached is v7.
The main focus in v7 is making the handling of required
inequality-strategy scan keys more explicit -- now there is an
understanding of inequalities shared by _bt_check_compare (the
function that becomes the guts of _bt_checkkeys) and the new
_bt_advance_array_keys function/state machine. The big idea for v7 is
to generalize how we handle required equality-strategy scan keys
(always required in both scan directions), extending the same concept
to deal with required inequality strategy scan keys (only ever
required in one direction, which may or may not be the scan
direction).
This led to my discovering and fixing a couple of bugs related to
inequality handling. These issues were of the same general character
as many others I've dealt with before now: they involved subtle
confusion about when and how to start another primitive index scan,
leading to the scan reading many more pages than strictly necessary
(potentially many more than master). In other words, cases where we
didn't give up and start another primitive index scan, even though
(with a repro of the issue) it's obviously not sensible. An accidental
full index scan.
While I'm still not completely happy with the way that inequalities
are handled, things in this area are much improved in v7.
It should be noted that the patch isn't strictly guaranteed to always
read fewer index pages than master, for a given query plan and index.
This is by design. Though the patch comes close, it's not quite a
certainty. There are known cases where the patch reads the occasional
extra page (relative to what master would have done under the same
circumstances). These are cases where the implementation just cannot
know for sure whether the next/sibling leaf page has key space covered
by any of the scan's array keys (at least not in a way that seems
practical). The implementation has simple heuristics that infer (a
polite word for "make an educated guess") about what will be found on
the next page. Theoretically we could be more conservative in how we
go about this, but that seems like a bad idea to me. It's really easy
to find cases where the maximally conservative approach loses by a
lot, and really hard to show cases where it wins at all.
These heuristics are more or less a limited form of the heuristics
that skip scan would need. A *very* limited form. We're still
conservative. Here's how it works, at a high level: if the scan can
make it all the way to the end of the page without having to start a
new primitive index scan (before reaching the end), and then finds
that "finaltup" itself (which is usually the page high key) advances
the array keys, we speculate: we move on to the sibling page. It's
just about possible that we'll discover (once on the next page) that
finaltup actually advanced the array keys by so much (in one single
advancement step) that the current/new keys cover key space beyond the
sibling page we just arrived at. The sibling page access will have
been wasted (though I prefer to think of it as a cost of doing
business).
I go into a lot of detail on the trade-offs in this area in comments
at the end of the new _bt_checkkeys(), just after it calls
_bt_advance_array_keys(). Hopefully this is reasonably clear. It's
always much easier to understand these things when you've written lots
of test cases, though. So I wouldn't at all be surprised to hear that
my explanation needs more work. I suspect that I'm spending more time
on the topic than it actually warrants, but you have to spend a lot of
time on it for yourself to be able to see why that is.
+++ b/src/backend/access/nbtree/nbtsearch.c @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) * set flag to true if all required keys are satisfied and false * otherwise. */ - (void) _bt_checkkeys(scan, itup, indnatts, dir, - &requiredMatchedByPrecheck, false); + _bt_checkkeys(scan, &pstate, itup, false, false); + requiredMatchedByPrecheck = pstate.continuescan; + pstate.continuescan = true; /* reset */The comment above the updated section needs to be updated.
Updated.
@@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) * set flag to true if all required keys are satisfied and false * otherwise. */ - (void) _bt_checkkeys(scan, itup, indnatts, dir, - &requiredMatchedByPrecheck, false); + _bt_checkkeys(scan, &pstate, itup, false, false);This 'false' finaltup argument is surely wrong for the rightmost
page's rightmost tuple, no?
Not in any practical sense. Since finaltup means "the tuple that you
should use to decide whether to go to the next page or not", and a
rightmost page doesn't have a next page.
There are exactly two ways that the top-level scan can end (not to be
confused with the primitive scan), at least in v7. They are:
1. The state machine can exhaust the scan's array keys, ending the
top-level scan.
2. The scan can just run out of pages, without ever running out of
array keys (some array keys can sort higher than any real value from
the index). This is just like how an index scan ends when it lacks any
required scan keys to terminate the scan, and eventually runs out of
pages to scan (think of an index-only scan that performs a full scan
of the index, feeding into a group aggregate).
Note that it wouldn't be okay if the design relied on _bt_checkkeys
advancing and exhausting the array keys -- we really do need both 1
and 2 to deal with various edge cases. For example, there is no way
that we'll ever be able to call _bt_checkkeys with a completely empty
index. It simply doesn't have any tuples at all. In fact, it doesn't
even have any pages (apart from the metapage), so clearly we can't
expect any calls to _bt_readpage (much less _bt_checkkeys).
+++ b/src/backend/access/nbtree/nbtutils.c @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan) + /* We could pfree(elem_values) after, but not worth the cycles */ + num_elems = _bt_merge_arrays(scan, cur, + (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0, + prev->elem_values, prev->num_elems, + elem_values, num_elems);This code can get hit several times when there are multiple = ANY
clauses, which may result in repeated leakage of these arrays during
this scan. I think cleaning up may well be worth the cycles when the
total size of the arrays is large enough.
They won't leak because the memory is allocated in the same dedicated
memory context.
That said, I added a pfree(). It couldn't hurt.
@@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey, _bt_compare_array_elements, &cxt); +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse, + Datum *elems_orig, int nelems_orig, + Datum *elems_next, int nelems_next) [...] + /* + * Incrementally copy the original array into a temp buffer, skipping over + * any items that are missing from the "next" array + */Given that we only keep the members that both arrays have in common,
the result array will be a strict subset of the original array. So, I
don't quite see why we need the temporary buffer here - we can reuse
the entries of the elems_orig array that we've already compared
against the elems_next array.
This code path is only hit when the query was written on autopilot,
since it must have contained redundant SAOPs for the same index column
-- a glaring inconsistency. Plus these arrays just aren't very big in
practice (despite my concerns about huge arrays). Plus there is only
one of these array-specific preprocessing steps per btrescan. So I
don't think that it's worth going to too much trouble here.
We may want to optimize this further by iterating over only the
smallest array: With the current code, [1, 2] + [1....1000] is faster
to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
smaller than 1000*log(2). In practice this may matter very little,
though.
An even better optimized version would do a merge join on the two
arrays, rather than loop + binary search.
v7 allocates the temp buffer using the size of whatever array is the
smaller of the two, just because it's an easy marginal improvement.
@@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg) [...] +_bt_binsrch_array_skey(FmgrInfo *orderproc,Is there a reason for this complex initialization of high/low_elem,
rather than the this easier to understand and more compact
initialization?:+ low_elem = 0; + high_elem = array->num_elems - 1; + if (cur_elem_start) + { + if (ScanDirectionIsForward(dir)) + low_elem = array->cur_elem; + else + high_elem = array->cur_elem; + }
I agree that it's better your way. Done that way in v7.
I think the conditional _bt_parallel_done(scan) feels misplaced here,
as the comment immediately below indicates the scan is to be
terminated after that comment. So, either move this _bt_parallel_done
call outside the function (which by name would imply it is read-only,
without side effects like this) or move it below the comment
"terminate the top-level scan".
v7 moves the comment up, so that it's just before the _bt_parallel_done() call.
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate, [...] + * Set up ORDER 3-way comparison function and array state [...] + * Optimization: Skip over non-required scan keys when we know thatThese two sections should probably be swapped, as the skip makes the
setup useless.
Not quite: we need to increment arrayidx for later loop iterations/scan keys.
Also, the comment here is wrong; the scan keys that are skipped are
'required', not 'non-required'.
Agreed. Fixed.
+++ b/src/test/regress/expected/join.out @@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]); Merge Cond: (j1.id1 = j2.id1) Join Filter: (j2.id2 = j1.id2) -> Index Scan using j1_id1_idx on j1 - -> Index Only Scan using j2_pkey on j2 + -> Index Scan using j2_id1_idx on j2 Index Cond: (id1 >= ANY ('{1,5}'::integer[])) - Filter: ((id1 % 1000) = 1) -(7 rows) +(6 rows)I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter
anymore. The output otherwise matches (quite possibly because the
other join conditions don't match) and I don't have time to
investigate the intricacies between IOS vs normal IS, but this feels
off.
This happens because the new plan uses a completely different index --
which happens to be a partial index whose predicate exactly matches
the old plan's filter quals. That factor makes the filter quals
unnecessary. That's all this is.
As for the planner changes, I don't think I'm familiar enough with the
planner to make any authorative comments on this. However, it does
look like you've changed the meaning of 'amsearcharray', and I'm not
sure it's OK to assume all indexes that support amsearcharray will
also support for this new assumption of ordered retrieval of SAOPs.
For one, the pgroonga extension [0] does mark
amcanorder+amsearcharray.
The changes that I've made to the planner are subtractive. We more or
less go back to how things were just after the initial work on nbtree
amsearcharray support. That work was (at least tacitly) assumed to
have no impact on ordered scans. Because why should it? What other
type of index clause has ever affected what seems like a rather
unrelated thing (namely the sort order of the scan)? The oversight was
understandable. The kinds of plans that master cannot produce output
for in standard index order are really silly plans, independent of
this issue; it makes zero sense to allow a non-required array scan key
to affect how or when we skip.
The code that I'm removing from the planner is code that quite
obviously assumes nbtree-like behavior. So I'm taking away code like
that, rather than adding new code like that. That said, I am really
surprised that any extension creates an index AM amcanorder=true (not
to be confused with amcanorderbyop=true, which is less surprising).
That means that it promises the planner that it behaves just like
nbtree. To quote the docs, it must have "btree-compatible strategy
numbers for their [its] equality and ordering operators". Is that
really something that pgroonga even attempts? And if so, why?
I also find it bizarre that pgroonga's handler-stated capabilities
include "amcanunique=true". So pgroonga is a full text search engine,
but also supports unique indexes? I find that particularly hard to
believe, and suspect that the way that they set things up in the AM
handler just isn't very well thought out.
--
Peter Geoghegan
Attachments:
v7-0001-Enhance-nbtree-ScalarArrayOp-execution.patchapplication/octet-stream; name=v7-0001-Enhance-nbtree-ScalarArrayOp-execution.patchDownload
From 8e3db71c09aa1ecad1a90a9ec8b4cdbd38c37097 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v7] Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).
Take this general approach a lot further: teach nbtree SAOP index scans
to determine how best to execute ScalarArrayOp scans (how many primitive
index scans to use under the hood) by applying information about the
physical characteristics of the index at runtime. This approach can be
far more efficient. Many cases that previously required thousands of
index descents now require as few as one single index descent. And, all
SAOP scans reliably avoid duplicative leaf page accesses (just like any
other nbtree index scan).
The array state machine now advances using binary searches for the array
element that best matches the next tuple's attribute value. This whole
process makes required scan key arrays (i.e. arrays from scan keys that
can terminate the scan) ratchet forward in lockstep with the index scan.
Non-required arrays (i.e. arrays from scan keys that can only exclude
non-matching tuples) are for the most part advanced via this same search
process. We just can't assume a fixed relationship between the current
element of any non-required array and the progress of the index scan
through the index's key space (that would be wrong).
Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, index scans of a composite index with (say) a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we'll mark non-required) will now reliably output rows
in index order. Such scans are always executed as one large index scan
under the hood, which is obviously the most efficient way to do it, for
the usual reason (no more wasting cycles on repeat leaf page accesses).
Generalizing SAOP execution along these lines removes any question of
index scans outputting tuples in any order that isn't the index's order.
This allow us to remove various special cases from the planner -- which
in turn makes the nbtree work more widely applicable and more effective.
Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute. These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths without any low-order
ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
We'll no longer generate these alternative paths, which can no longer
offer any advantage over the index qual paths that we do still generate.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes. In particular, they can
avoid the extra heap page accesses previously incurred when using filter
quals to exclude non-matching tuples (index quals can be used instead).
This shift is expected to be fairly common in real world applications,
especially with queries that have multiple SAOPs that can now all be
used as index quals when scanning a composite index. Queries with
low-order SAOPs (especially non-required ones) are also likely to see a
significant reduction in heap page accesses.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
src/include/access/nbtree.h | 42 +-
src/backend/access/nbtree/nbtree.c | 63 +-
src/backend/access/nbtree/nbtsearch.c | 84 +-
src/backend/access/nbtree/nbtutils.c | 1727 +++++++++++++++++++-
src/backend/optimizer/path/indxpath.c | 86 +-
src/backend/utils/adt/selfuncs.c | 122 +-
doc/src/sgml/monitoring.sgml | 13 +
src/test/regress/expected/create_index.out | 61 +-
src/test/regress/expected/join.out | 5 +-
src/test/regress/sql/create_index.sql | 20 +-
10 files changed, 1932 insertions(+), 291 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086..566e1c15d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -965,7 +965,7 @@ typedef struct BTScanPosData
* moreLeft and moreRight track whether we think there may be matching
* index entries to the left and right of the current page, respectively.
* We can clear the appropriate one of these flags when _bt_checkkeys()
- * returns continuescan = false.
+ * sets BTReadPageState.continuescan = false.
*/
bool moreLeft;
bool moreRight;
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
/* workspace for SK_SEARCHARRAY support */
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- bool arraysStarted; /* Started array keys, but have yet to "reach
- * past the end" of all arrays? */
int numArrayKeys; /* number of equality-type array keys (-1 if
* there are any unsatisfiable array keys) */
- int arrayKeyCount; /* count indicating number of array scan keys
- * processed */
+ bool needPrimScan; /* Perform another primitive scan? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
+ FmgrInfo *orderProcs; /* ORDER procs for equality constraint keys */
+ int numPrimScans; /* Running tally of # primitive index scans
+ * (used to coordinate parallel workers) */
MemoryContext arrayContext; /* scan-lifespan context for array data */
/* info about killed items if any (killedItems is NULL if never used) */
@@ -1083,6 +1083,29 @@ typedef struct BTScanOpaqueData
typedef BTScanOpaqueData *BTScanOpaque;
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the final tuple from the page. This must happen
+ * before the first call to _bt_checkkeys. _bt_checkkeys uses the final tuple
+ * to manage advancement of the scan's array keys more efficiently.
+ */
+typedef struct BTReadPageState
+{
+ /* Input parameters, set by _bt_readpage */
+ ScanDirection dir; /* current scan direction */
+ IndexTuple finaltup; /* final tuple (high key for forward scans) */
+
+ /* Output parameters, set by _bt_checkkeys */
+ bool continuescan; /* Terminate ongoing (primitive) index scan? */
+
+ /* Private _bt_checkkeys-managed state */
+ bool finaltupchecked; /* final tuple checked against current
+ * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
/*
* We use some private sk_flags bits in preprocessed scan keys. We're allowed
* to use bits 16-31 (see skey.h). The uppermost bits are copied from the
@@ -1090,6 +1113,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
*/
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
+#define SK_BT_RDDNARRAY 0x00040000 /* redundant in array preprocessing */
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1160,7 +1184,7 @@ extern bool btcanreturn(Relation index, int attno);
extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
/*
* prototypes for functions in nbtdedup.c
@@ -1253,12 +1277,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
- int tupnatts, ScanDirection dir, bool *continuescan,
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup,
bool requiredMatchedByPrecheck);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a88b36a58..6328a8a63 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
* BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
* to a new page; some process can start doing that.
*
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit). Reached once per primitive index scan.
*/
typedef enum
{
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
BTPS_State btps_pageStatus; /* indicates whether next page is
* available for scan. see above for
* possible states of parallel scan. */
- int btps_arrayKeyCount; /* count indicating number of array scan
- * keys processed by parallel scan */
+ int btps_numPrimScans; /* count indicating number of primitive
+ * index scans (used with array keys) */
slock_t btps_mutex; /* protects above variables */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
} BTParallelScanDescData;
@@ -275,8 +275,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
/* If we have a tuple, return it ... */
if (res)
break;
- /* ... otherwise see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+ /* ... otherwise see if we need another primitive index scan */
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
return res;
}
@@ -333,8 +333,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
ntids++;
}
}
- /* Now see if we have more array keys to deal with */
- } while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+ /* Now see if we need another primitive index scan */
+ } while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
return ntids;
}
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->keyData = NULL;
so->arrayKeyData = NULL; /* assume no array keys for now */
- so->arraysStarted = false;
so->numArrayKeys = 0;
+ so->needPrimScan = false;
so->arrayKeys = NULL;
+ so->orderProcs = NULL;
so->arrayContext = NULL;
so->killedItems = NULL; /* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
}
so->markItemIndex = -1;
- so->arrayKeyCount = 0;
+ so->needPrimScan = false;
+ so->numPrimScans = 0;
so->firstPage = false;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
@@ -588,7 +590,7 @@ btinitparallelscan(void *target)
SpinLockInit(&bt_target->btps_mutex);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- bt_target->btps_arrayKeyCount = 0;
+ bt_target->btps_numPrimScans = 0;
ConditionVariableInit(&bt_target->btps_cv);
}
@@ -614,7 +616,7 @@ btparallelrescan(IndexScanDesc scan)
SpinLockAcquire(&btscan->btps_mutex);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount = 0;
+ btscan->btps_numPrimScans = 0;
SpinLockRelease(&btscan->btps_mutex);
}
@@ -625,7 +627,11 @@ btparallelrescan(IndexScanDesc scan)
*
* The return value is true if we successfully seized the scan and false
* if we did not. The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys. It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
*
* If the return value is true, *pageno returns the next or current page
* of the scan (depending on the scan direction). An invalid block number
@@ -656,16 +662,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
SpinLockAcquire(&btscan->btps_mutex);
pageStatus = btscan->btps_pageStatus;
- if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ if (so->numPrimScans < btscan->btps_numPrimScans)
{
- /* Parallel scan has already advanced to a new set of scankeys. */
+ /* Top-level scan already moved on to next primitive index scan */
status = false;
}
else if (pageStatus == BTPARALLEL_DONE)
{
/*
- * We're done with this set of scankeys. This may be the end, or
- * there could be more sets to try.
+ * We're done with this primitive index scan. This might have
+ * been the final primitive index scan required, or the top-level
+ * index scan might require additional primitive scans.
*/
status = false;
}
@@ -697,9 +704,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
void
_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
{
+ BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
+ Assert(!so->needPrimScan);
+
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
@@ -733,12 +743,11 @@ _bt_parallel_done(IndexScanDesc scan)
parallel_scan->ps_offset);
/*
- * Mark the parallel scan as done for this combination of scan keys,
- * unless some other process already did so. See also
- * _bt_advance_array_keys.
+ * Mark the primitive index scan as done, unless some other process
+ * already did so. See also _bt_array_keys_remain.
*/
SpinLockAcquire(&btscan->btps_mutex);
- if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+ if (so->numPrimScans >= btscan->btps_numPrimScans &&
btscan->btps_pageStatus != BTPARALLEL_DONE)
{
btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -752,14 +761,14 @@ _bt_parallel_done(IndexScanDesc scan)
}
/*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- * keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ * counter when array keys are in use.
*
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
* scans.
*/
void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -768,13 +777,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- so->arrayKeyCount++;
+ so->numPrimScans++;
SpinLockAcquire(&btscan->btps_mutex);
if (btscan->btps_pageStatus == BTPARALLEL_DONE)
{
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
- btscan->btps_arrayKeyCount++;
+ btscan->btps_numPrimScans++;
}
SpinLockRelease(&btscan->btps_mutex);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5..834012514 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
*/
if (!so->qual_ok)
{
- /* Notify any other workers that we're done with this scan key. */
+ /* Notify any other workers that this primitive scan is done */
_bt_parallel_done(scan);
return false;
}
@@ -1537,9 +1537,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
BTPageOpaque opaque;
OffsetNumber minoff;
OffsetNumber maxoff;
- int itemIndex;
- bool continuescan;
- int indnatts;
+ BTReadPageState pstate;
+ int numArrayKeys,
+ itemIndex;
bool requiredMatchedByPrecheck;
/*
@@ -1560,8 +1560,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
}
- continuescan = true; /* default assumption */
- indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+ pstate.dir = dir;
+ pstate.finaltup = NULL;
+ pstate.continuescan = true; /* default assumption */
+ pstate.finaltupchecked = false;
+ numArrayKeys = so->numArrayKeys;
+
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
@@ -1609,9 +1613,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* the last item on the page would give a more precise answer.
*
* We skip this for the first page in the scan to evade the possible
- * slowdown of the point queries.
+ * slowdown of point queries. Never apply the optimization with a scans
+ * that uses array keys, either, since that breaks certain assumptions.
+ * (Our search-type scan keys change whenever _bt_checkkeys advances the
+ * arrays, invalidating any precheck. Tracking all that would be tricky.)
*/
- if (!so->firstPage && minoff < maxoff)
+ if (!so->firstPage && !numArrayKeys && minoff < maxoff)
{
ItemId iid;
IndexTuple itup;
@@ -1625,8 +1632,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* set flag to true if all required keys are satisfied and false
* otherwise.
*/
- (void) _bt_checkkeys(scan, itup, indnatts, dir,
- &requiredMatchedByPrecheck, false);
+ _bt_checkkeys(scan, &pstate, itup, false, false);
+ requiredMatchedByPrecheck = pstate.continuescan;
+ pstate.continuescan = true; /* reset */
}
else
{
@@ -1636,6 +1644,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (ScanDirectionIsForward(dir))
{
+ /* SK_SEARCHARRAY forward scans must provide high key up front */
+ if (numArrayKeys && !P_RIGHTMOST(opaque))
+ {
+ ItemId iid = PageGetItemId(page, P_HIKEY);
+
+ pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+ }
+
/* load items[] in ascending order */
itemIndex = 0;
@@ -1659,8 +1675,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, requiredMatchedByPrecheck);
+ passes_quals = _bt_checkkeys(scan, &pstate, itup, false,
+ requiredMatchedByPrecheck);
/*
* If the result of prechecking required keys was true, then in
@@ -1668,8 +1684,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* result is the same.
*/
Assert(!requiredMatchedByPrecheck ||
- passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, false));
+ passes_quals == _bt_checkkeys(scan, &pstate, itup, false,
+ false));
if (passes_quals)
{
/* tuple passes all scan key conditions */
@@ -1703,7 +1719,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
/* When !continuescan, there can't be any more matches, so stop */
- if (!continuescan)
+ if (!pstate.continuescan)
break;
offnum = OffsetNumberNext(offnum);
@@ -1720,17 +1736,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* only appear on non-pivot tuples on the right sibling page are
* common.
*/
- if (continuescan && !P_RIGHTMOST(opaque))
+ if (pstate.continuescan && !P_RIGHTMOST(opaque))
{
ItemId iid = PageGetItemId(page, P_HIKEY);
- IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
- int truncatt;
+ IndexTuple itup;
- truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
- _bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
+ itup = (IndexTuple) PageGetItem(page, iid);
+ _bt_checkkeys(scan, &pstate, itup, true, false);
}
- if (!continuescan)
+ if (!pstate.continuescan)
so->currPos.moreRight = false;
Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1740,6 +1755,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
else
{
+ /* SK_SEARCHARRAY backward scans must provide final tuple up front */
+ if (numArrayKeys && minoff <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, minoff);
+
+ pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+ }
+
/* load items[] in descending order */
itemIndex = MaxTIDsPerBTreePage;
@@ -1751,6 +1774,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
IndexTuple itup;
bool tuple_alive;
bool passes_quals;
+ bool finaltup = (offnum == minoff);
/*
* If the scan specifies not to return killed tuples, then we
@@ -1761,12 +1785,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* tuple on the page, we do check the index keys, to prevent
* uselessly advancing to the page to the left. This is similar
* to the high key optimization used by forward scans.
+ *
+ * Separately, _bt_checkkeys actually requires that we call it
+ * with the final non-pivot tuple from the page, if there's one
+ * (final processed tuple, or first tuple in offset number terms).
+ * We must indicate which particular tuple comes last, too.
*/
if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
{
Assert(offnum >= P_FIRSTDATAKEY(opaque));
- if (offnum > P_FIRSTDATAKEY(opaque))
+ if (!finaltup)
{
+ Assert(offnum > minoff);
offnum = OffsetNumberPrev(offnum);
continue;
}
@@ -1778,8 +1808,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, iid);
- passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, requiredMatchedByPrecheck);
+ passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup,
+ requiredMatchedByPrecheck);
/*
* If the result of prechecking required keys was true, then in
@@ -1787,8 +1817,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* result is the same.
*/
Assert(!requiredMatchedByPrecheck ||
- passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
- &continuescan, false));
+ passes_quals == _bt_checkkeys(scan, &pstate, itup,
+ finaltup, false));
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions */
@@ -1827,7 +1857,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
}
}
}
- if (!continuescan)
+ if (!pstate.continuescan)
{
/* there can't be any more matches, so stop */
so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1510b97fb..4d8e33a4d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
typedef struct BTSortArrayContext
{
- FmgrInfo flinfo;
+ FmgrInfo *orderproc;
Oid collation;
bool reverse;
} BTSortArrayContext;
@@ -41,15 +41,41 @@ typedef struct BTSortArrayContext
static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
StrategyNumber strat,
Datum *elems, int nelems);
+static void _bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey);
static int _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
bool reverse,
Datum *elems, int nelems);
+static int _bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+ Datum *elems_orig, int nelems_orig,
+ Datum *elems_next, int nelems_next);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
+ Datum tupdatum, bool tupnull,
+ Datum arrdatum, ScanKey cur);
+static int _bt_binsrch_array_skey(FmgrInfo *orderproc,
+ bool cur_elem_start, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *final_result);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+ BTReadPageState *pstate,
+ IndexTuple tuple, int sktrig);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, int sktrig);
+static void _bt_update_keys_with_arraykeys(IndexScanDesc scan);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
+#endif
static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
ScanKey leftarg, ScanKey rightarg,
bool *result);
static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, BTScanOpaque so,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, int *sktrig,
+ bool requiredMatchedByPrecheck);
static bool _bt_check_rowcompare(ScanKey skey,
IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
ScanDirection dir, bool *continuescan);
@@ -198,13 +224,48 @@ _bt_freestack(BTStack stack)
* If there are any SK_SEARCHARRAY scan keys, deconstruct the array(s) and
* set up BTArrayKeyInfo info for each one that is an equality-type key.
* Prepare modified scan keys in so->arrayKeyData, which will hold the current
- * array elements during each primitive indexscan operation. For inequality
- * array keys, it's sufficient to find the extreme element value and replace
- * the whole array with that scalar value.
+ * array elements.
+ *
+ * _bt_preprocess_keys treats each primitive scan as an independent piece of
+ * work. That structure pushes the responsibility for preprocessing that must
+ * work "across array keys" onto us. This division of labor makes sense once
+ * you consider that we're typically called no more than once per btrescan,
+ * whereas _bt_preprocess_keys is always called once per primitive index scan.
+ *
+ * Currently we perform two kinds of preprocessing to deal with redundancies.
+ * For inequality array keys, it's sufficient to find the extreme element
+ * value and replace the whole array with that scalar value. This eliminates
+ * all but one array key as redundant. Similarly, we are capable of "merging
+ * together" multiple equality array keys from two or more input scan keys
+ * into a single output scan key that contains only the intersecting array
+ * elements. This can eliminate many redundant array elements, as well as
+ * eliminating whole array scan keys as redundant.
+ *
+ * Note: _bt_start_array_keys actually sets up the cur_elem counters later on,
+ * once the scan direction is known.
*
* Note: the reason we need so->arrayKeyData, rather than just scribbling
* on scan->keyData, is that callers are permitted to call btrescan without
* supplying a new set of scankey data.
+ *
+ * Note: _bt_preprocess_keys is responsible for creating the so->keyData scan
+ * keys used by _bt_checkkeys. Index scans that don't use equality array keys
+ * will have _bt_preprocess_keys treat scan->keyData as input and so->keyData
+ * as output. Scans that use equality array keys have _bt_preprocess_keys
+ * treat so->arrayKeyData (which is our output) as their input, while (as per
+ * usual) outputting so->keyData for _bt_checkkeys. This function adds an
+ * additional layer of indirection that allows _bt_preprocess_keys to more or
+ * less avoid dealing with SK_SEARCHARRAY as a special case.
+ *
+ * Note: _bt_update_keys_with_arraykeys works by updating already-processed
+ * output keys (so->keyData) in-place. It cannot eliminate redundant or
+ * contradictory scan keys. This necessitates having _bt_preprocess_keys
+ * understand that it is unsafe to eliminate "redundant" SK_SEARCHARRAY
+ * equality scan keys on the basis of what is actually just the current array
+ * key values -- it must conservatively assume that such a scan key might no
+ * longer be redundant after the next _bt_update_keys_with_arraykeys call.
+ * Ideally we'd be able to deal with that by eliminating a subset of truly
+ * redundant array keys up-front, but it doesn't seem worth the trouble.
*/
void
_bt_preprocess_array_keys(IndexScanDesc scan)
@@ -212,7 +273,9 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int numberOfKeys = scan->numberOfKeys;
int16 *indoption = scan->indexRelation->rd_indoption;
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
int numArrayKeys;
+ int lastEqualityArrayAtt = -1;
ScanKey cur;
int i;
MemoryContext oldContext;
@@ -265,6 +328,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+ so->orderProcs = (FmgrInfo *) palloc0(nkeyatts * sizeof(FmgrInfo));
/* Now process each array key */
numArrayKeys = 0;
@@ -281,6 +345,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int j;
cur = &so->arrayKeyData[i];
+
+ /*
+ * Attributes with equality-type scan keys (including but not limited
+ * to array scan keys) will need a 3-way comparison function. Set
+ * that up now. (Avoids repeating work for the same attribute.)
+ */
+ if (cur->sk_strategy == BTEqualStrategyNumber &&
+ !OidIsValid(so->orderProcs[cur->sk_attno - 1].fn_oid))
+ _bt_sort_array_cmp_setup(scan, cur);
+
if (!(cur->sk_flags & SK_SEARCHARRAY))
continue;
@@ -357,6 +431,47 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
(indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
elem_values, num_nonnulls);
+ /*
+ * If this scan key is semantically equivalent to a previous equality
+ * operator array scan key, merge the two arrays together to eliminate
+ * redundant non-intersecting elements (and redundant whole scan keys)
+ */
+ if (lastEqualityArrayAtt == cur->sk_attno)
+ {
+ BTArrayKeyInfo *prev = &so->arrayKeys[numArrayKeys - 1];
+
+ Assert(so->arrayKeyData[prev->scan_key].sk_func.fn_oid ==
+ cur->sk_func.fn_oid);
+ Assert(so->arrayKeyData[prev->scan_key].sk_subtype ==
+ cur->sk_subtype);
+
+ num_elems = _bt_merge_arrays(scan, cur,
+ (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
+ prev->elem_values, prev->num_elems,
+ elem_values, num_elems);
+
+ pfree(elem_values);
+
+ /*
+ * If there are no intersecting elements left from merging this
+ * array into the previous array on the same attribute, the scan
+ * qual is unsatisfiable
+ */
+ if (num_elems == 0)
+ {
+ numArrayKeys = -1;
+ break;
+ }
+
+ /*
+ * Lower the number of elements from the previous array, and mark
+ * this scan key/array as redundant for every primitive index scan
+ */
+ prev->num_elems = num_elems;
+ cur->sk_flags |= SK_BT_RDDNARRAY;
+ continue;
+ }
+
/*
* And set up the BTArrayKeyInfo data.
*/
@@ -364,6 +479,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
numArrayKeys++;
+ lastEqualityArrayAtt = cur->sk_attno;
}
so->numArrayKeys = numArrayKeys;
@@ -437,26 +553,28 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
}
/*
- * _bt_sort_array_elements() -- sort and de-dup array elements
+ * _bt_sort_array_cmp_setup() -- Look up array comparison function
*
- * The array elements are sorted in-place, and the new number of elements
- * after duplicate removal is returned.
- *
- * scan and skey identify the index column, whose opfamily determines the
- * comparison semantics. If reverse is true, we sort in descending order.
+ * Sets so->orderProcs[] for scan key's attribute. This is used to sort and
+ * deduplicate the attribute's array (if any). It's also used during binary
+ * searches of the next array key matching index tuples just beyond the range
+ * of the scan's current set of array keys.
*/
-static int
-_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
- bool reverse,
- Datum *elems, int nelems)
+static void
+_bt_sort_array_cmp_setup(IndexScanDesc scan, ScanKey skey)
{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
Oid elemtype;
RegProcedure cmp_proc;
- BTSortArrayContext cxt;
+ FmgrInfo *orderproc = &so->orderProcs[skey->sk_attno - 1];
- if (nelems <= 1)
- return nelems; /* no work to do */
+ /*
+ * Should do this for all equality strategy scan keys only (including
+ * those without any array). See _bt_advance_array_keys for details of
+ * why we need an ORDER proc for non-array equality strategy scan keys.
+ */
+ Assert(skey->sk_strategy == BTEqualStrategyNumber);
/*
* Determine the nominal datatype of the array elements. We have to
@@ -471,12 +589,10 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
* Look up the appropriate comparison function in the opfamily.
*
* Note: it's possible that this would fail, if the opfamily is
- * incomplete, but it seems quite unlikely that an opfamily would omit
- * non-cross-type support functions for any datatype that it supports at
- * all.
+ * incomplete.
*/
cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
- elemtype,
+ rel->rd_opcintype[skey->sk_attno - 1],
elemtype,
BTORDER_PROC);
if (!RegProcedureIsValid(cmp_proc))
@@ -484,8 +600,32 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
BTORDER_PROC, elemtype, elemtype,
rel->rd_opfamily[skey->sk_attno - 1]);
+ /* Save in orderproc entry for attribute */
+ fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
+/*
+ * _bt_sort_array_elements() -- sort and de-dup array elements
+ *
+ * The array elements are sorted in-place, and the new number of elements
+ * after duplicate removal is returned.
+ *
+ * scan and skey identify the index column, whose opfamily determines the
+ * comparison semantics. If reverse is true, we sort in descending order.
+ */
+static int
+_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
+ bool reverse,
+ Datum *elems, int nelems)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTSortArrayContext cxt;
+
+ if (nelems <= 1)
+ return nelems; /* no work to do */
+
/* Sort the array elements */
- fmgr_info(cmp_proc, &cxt.flinfo);
+ cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
cxt.collation = skey->sk_collation;
cxt.reverse = reverse;
qsort_arg(elems, nelems, sizeof(Datum),
@@ -496,6 +636,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
_bt_compare_array_elements, &cxt);
}
+/*
+ * _bt_merge_arrays() -- merge together duplicate array keys
+ *
+ * Both scan keys have array elements that have already been sorted and
+ * deduplicated.
+ */
+static int
+_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
+ Datum *elems_orig, int nelems_orig,
+ Datum *elems_next, int nelems_next)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BTSortArrayContext cxt;
+ Datum *merged = palloc(sizeof(Datum) * Min(nelems_orig, nelems_next));
+ int merged_nelems = 0;
+
+ /*
+ * Incrementally copy the original array into a temp buffer, skipping over
+ * any items that are missing from the "next" array
+ */
+ cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
+ cxt.collation = skey->sk_collation;
+ cxt.reverse = reverse;
+ for (int i = 0; i < nelems_orig; i++)
+ {
+ Datum *elem = elems_orig + i;
+
+ if (bsearch_arg(elem, elems_next, nelems_next, sizeof(Datum),
+ _bt_compare_array_elements, &cxt))
+ merged[merged_nelems++] = *elem;
+ }
+
+ /*
+ * Overwrite the original array with temp buffer so that we're only left
+ * with intersecting array elements
+ */
+ memcpy(elems_orig, merged, merged_nelems * sizeof(Datum));
+ pfree(merged);
+
+ return merged_nelems;
+}
+
/*
* qsort_arg comparator for sorting array elements
*/
@@ -507,7 +689,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
int32 compare;
- compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+ compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
cxt->collation,
da, db));
if (cxt->reverse)
@@ -515,6 +697,158 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
return compare;
}
+/*
+ * _bt_compare_array_skey() -- apply array comparison function
+ *
+ * Compares caller's tuple attribute value to a scan key/array element.
+ * Helper function used during binary searches of SK_SEARCHARRAY arrays.
+ *
+ * This routine returns:
+ * <0 if tupdatum < arrdatum;
+ * 0 if tupdatum == arrdatum;
+ * >0 if tupdatum > arrdatum.
+ *
+ * This is essentially the same interface as _bt_compare: both functions
+ * compare the value that they're searching for to a binary search pivot.
+ * However, unlike _bt_compare, this function's "tuple argument" comes first,
+ * while its "array/scankey argument" comes second.
+*/
+static inline int32
+_bt_compare_array_skey(FmgrInfo *orderproc,
+ Datum tupdatum, bool tupnull,
+ Datum arrdatum, ScanKey cur)
+{
+ int32 result = 0;
+
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ if (tupnull) /* NULL tupdatum */
+ {
+ if (cur->sk_flags & SK_ISNULL)
+ result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = -1; /* NULL "<" NOT_NULL */
+ else
+ result = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+ {
+ if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ result = 1; /* NOT_NULL ">" NULL */
+ else
+ result = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ /*
+ * Like _bt_compare, we need to be careful of cross-type comparisons,
+ * so the left value has to be the value that came from an index tuple
+ */
+ result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+ tupdatum, arrdatum));
+
+ /*
+ * We flip the sign by following the obvious rule: flip whenever the
+ * column is a DESC column.
+ *
+ * _bt_compare does it the wrong way around (flip when *ASC*) in order
+ * to compensate for passing its orderproc arguments backwards. We
+ * don't need to play these games because we find it natural to pass
+ * tupdatum as the left value (and arrdatum as the right value).
+ */
+ if (cur->sk_flags & SK_BT_DESC)
+ INVERT_COMPARE_RESULT(result);
+ }
+
+ return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * Returns an index to the first array element >= caller's tupdatum argument.
+ * This convention is more natural for forwards scan callers, but that can't
+ * really matter to backwards scan callers. Both callers require handling for
+ * the case where the match we return is < tupdatum, and symmetric handling
+ * for the case where our best match is > tupdatum.
+ *
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * compared the returned array element to caller's tupdatum argument. This
+ * helps caller to decide what to do next. Caller should only accept the
+ * element we locate as-is when it's an exact match (i.e. *final_result is 0).
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan). This (and information about the scan's direction) allows
+ * searches against required scan key arrays to reuse earlier search bounds.
+ */
+static int
+_bt_binsrch_array_skey(FmgrInfo *orderproc,
+ bool cur_elem_start, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *final_result)
+{
+ int low_elem,
+ mid_elem,
+ high_elem,
+ result = 0;
+
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+ low_elem = 0;
+ mid_elem = -1;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ {
+ if (ScanDirectionIsForward(dir))
+ low_elem = array->cur_elem;
+ else
+ high_elem = array->cur_elem;
+ }
+
+ while (high_elem > low_elem)
+ {
+ Datum arrdatum;
+
+ mid_elem = low_elem + ((high_elem - low_elem) / 2);
+ arrdatum = array->elem_values[mid_elem];
+
+ result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+ arrdatum, cur);
+
+ if (result == 0)
+ {
+ /*
+ * Each array was deduplicated during initial preprocessing, so
+ * it's safe to quit as soon as we see an equal array element.
+ * This often saves an extra comparison or two...
+ */
+ low_elem = mid_elem;
+ break;
+ }
+
+ if (result > 0)
+ low_elem = mid_elem + 1;
+ else
+ high_elem = mid_elem;
+ }
+
+ /*
+ * ...but our caller also cares about how its searched-for tuple datum
+ * compares to the array element we'll return. We must set *final_result
+ * with the result of that comparison specifically.
+ */
+ if (low_elem != mid_elem)
+ result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+ array->elem_values[low_elem], cur);
+
+ *final_result = result;
+
+ return low_elem;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -539,30 +873,35 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
curArrayKey->cur_elem = 0;
skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
}
-
- so->arraysStarted = true;
}
/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
+ * _bt_advance_array_keys_increment() -- Advance to next set of array elements
+ *
+ * Advances the array keys by a single increment in the current scan
+ * direction. When there are multiple array keys this can roll over from the
+ * lowest order array to higher order arrays.
*
* Returns true if there is another set of values to consider, false if not.
* On true result, the scankeys are initialized with the next set of values.
+ * On false result, the scankeys stay the same, and the array keys are not
+ * advanced (every array is still at its final element for scan direction).
*/
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool found = false;
- int i;
+
+ Assert(!so->needPrimScan);
/*
* We must advance the last array key most quickly, since it will
* correspond to the lowest-order index column among the available
- * qualifications. This is necessary to ensure correct ordering of output
- * when there are multiple array keys.
+ * qualifications. Rolling over like this is necessary to ensure correct
+ * ordering of output when there are multiple array keys.
*/
- for (i = so->numArrayKeys - 1; i >= 0; i--)
+ for (int i = so->numArrayKeys - 1; i >= 0; i--)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->arrayKeyData[curArrayKey->scan_key];
@@ -596,19 +935,31 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
break;
}
- /* advance parallel scan */
- if (scan->parallel_scan != NULL)
- _bt_parallel_advance_array_keys(scan);
+ if (found)
+ return true;
/*
- * When no new array keys were found, the scan is "past the end" of the
- * array keys. _bt_start_array_keys can still "restart" the array keys if
- * a rescan is required.
+ * Don't allow the entire set of array keys to roll over: restore the
+ * array keys to the state they were in before we were called.
+ *
+ * This ensures that the array keys only ratchet forward (or backwards in
+ * the case of backward scans). Our "so->arrayKeyData[]" scan keys should
+ * always match the current "so->keyData[]" search-type scan keys (except
+ * for a brief moment during array key advancement).
*/
- if (!found)
- so->arraysStarted = false;
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *rollarray = &so->arrayKeys[i];
+ ScanKey skey = &so->arrayKeyData[rollarray->scan_key];
- return found;
+ if (ScanDirectionIsBackward(dir))
+ rollarray->cur_elem = 0;
+ else
+ rollarray->cur_elem = rollarray->num_elems - 1;
+ skey->sk_argument = rollarray->elem_values[rollarray->cur_elem];
+ }
+
+ return false;
}
/*
@@ -661,20 +1012,845 @@ _bt_restore_array_keys(IndexScanDesc scan)
* If we changed any keys, we must redo _bt_preprocess_keys. That might
* sound like overkill, but in cases with multiple keys per index column
* it seems necessary to do the full set of pushups.
- *
- * Also do this whenever the scan's set of array keys "wrapped around" at
- * the end of the last primitive index scan. There won't have been a call
- * to _bt_preprocess_keys from some other place following wrap around, so
- * we do it for ourselves.
*/
- if (changed || !so->arraysStarted)
- {
+ if (changed)
_bt_preprocess_keys(scan);
- /* The mark should have been set on a consistent set of keys... */
- Assert(so->qual_ok);
- }
+
+ Assert(_bt_verify_keys_with_arraykeys(scan));
}
+/*
+ * _bt_tuple_before_array_skeys() -- _bt_checkkeys array helper function
+ *
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) must advance the scan's array keys.
+ * Only call here when _bt_check_compare already set continuescan=false.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans). This means that it cannot possibly be time to advance the array
+ * keys just yet. _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfying our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans). This means that it is now time for our
+ * caller to advance the array keys (unless caller broke the rules by not
+ * checking with _bt_check_compare before calling here).
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums. See
+ * _bt_advance_array_keys and its handling of inequalities for details.
+ *
+ * Note: caller passes _bt_check_compare-set sktrig value to indicate which
+ * scan key triggered the call. If this is for any scan key that isn't a
+ * required equality strategy scan key, calling here is a no-op, meaning that
+ * we'll invariably return false. We just accept whatever _bt_check_compare
+ * indicated about the scan when it involves a required inequality scan key.
+ * We never care about nonrequired scan keys, including equality strategy
+ * array scan keys (though _bt_check_compare can temporarily end the scan to
+ * advance their ararys in _bt_advance_array_keys, which we'll never prevent).
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, int sktrig)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ bool tuple_before_array_keys = false;
+ ScanKey cur;
+ int ntupatts = BTreeTupleGetNAtts(tuple, rel),
+ ikey;
+
+ Assert(so->numArrayKeys > 0);
+ Assert(so->numberOfKeys > 0);
+ Assert(!so->needPrimScan);
+
+ for (cur = so->keyData + sktrig, ikey = sktrig;
+ ikey < so->numberOfKeys;
+ cur++, ikey++)
+ {
+ int attnum = cur->sk_attno;
+ FmgrInfo *orderproc;
+ Datum tupdatum;
+ bool tupnull;
+ int32 result;
+
+ /*
+ * Unlike _bt_check_compare and _bt_advance_array_keys, we never deal
+ * with inequality strategy scan keys (even those marked required). We
+ * also don't deal with non-required equality keys -- even when they
+ * happen to have arrays that might need to be advanced.
+ *
+ * Note: cannot "break" here due to corner cases involving redundant
+ * scan keys that weren't eliminated within _bt_preprocess_keys.
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber ||
+ (cur->sk_flags & SK_BT_REQFWD) == 0)
+ continue;
+
+ /* Required equality scan keys always required in both directions */
+ Assert((cur->sk_flags & SK_BT_REQFWD) &&
+ (cur->sk_flags & SK_BT_REQBKWD));
+
+ if (attnum > ntupatts)
+ {
+ /*
+ * When we reach a high key's truncated attribute, assume that the
+ * tuple attribute's value is >= the scan's equality constraint
+ * scan keys, forcing another _bt_advance_array_keys call.
+ *
+ * You might wonder why we don't treat truncated attributes as
+ * having values < our equality constraints instead; we're not
+ * treating the truncated attributes as having -inf values here,
+ * which is how things are done in _bt_compare.
+ *
+ * We're often called during finaltup prechecks, where we help our
+ * caller to decide whether or not it should terminate the current
+ * primitive index scan. Our behavior here implements a policy of
+ * being slightly optimistic about what will be found on the next
+ * page when the current primitive scan continues onto that page.
+ * (This is also closest to what _bt_check_compare does.)
+ */
+ break;
+ }
+
+ tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+ orderproc = &so->orderProcs[attnum - 1];
+ result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+ cur->sk_argument, cur);
+
+ if (result != 0)
+ {
+ if (ScanDirectionIsForward(dir))
+ tuple_before_array_keys = result < 0;
+ else
+ tuple_before_array_keys = result > 0;
+
+ break;
+ }
+ }
+
+ return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- start scheduled primitive index scan?
+ *
+ * Returns true if _bt_checkkeys scheduled another primitive index scan, just
+ * as the last one ended. Otherwise returns false, indicating that the array
+ * keys are now fully exhausted.
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ Assert(so->numArrayKeys);
+
+ /*
+ * Array keys are advanced within _bt_checkkeys when the scan reaches the
+ * leaf level (more precisely, they're advanced when the scan reaches the
+ * end of each distinct set of array elements). This process avoids
+ * repeat access to leaf pages (across multiple primitive index scans) by
+ * advancing the scan's array keys when it allows the primitive index scan
+ * to find nearby matching tuples (or when it eliminates ranges of array
+ * key space that can't possibly be satisfied by any index tuple).
+ *
+ * _bt_checkkeys sets a simple flag variable to schedule another primitive
+ * index scan. This tells us what to do. We cannot rely on _bt_first
+ * always reaching _bt_checkkeys, though. There are various cases where
+ * that won't happen. For example, if the index is completely empty, then
+ * _bt_first won't get as far as calling _bt_readpage/_bt_checkkeys.
+ *
+ * We also don't expect _bt_checkkeys to be reached when searching for a
+ * non-existent value that happens to be higher than any existing value in
+ * the index. No _bt_checkkeys are expected when _bt_readpage reads the
+ * rightmost page during such a scan -- even a _bt_checkkeys call against
+ * the high key won't happen. There is an analogous issue for backwards
+ * scans that search for a value lower than all existing index tuples.
+ *
+ * We don't actually require special handling for these cases -- we don't
+ * need to be explicitly instructed to _not_ perform another primitive
+ * index scan. This is correct for all of the cases we've listed so far,
+ * which all involve primitive index scans that access pages "near the
+ * boundaries of the key space" (the leftmost page, the rightmost page, or
+ * an imaginary empty leaf root page). If _bt_checkkeys cannot be reached
+ * by a primitive index scan for one set of array keys, it follows that it
+ * also won't be reached for any later set of array keys...
+ */
+ if (!so->qual_ok)
+ {
+ /*
+ * ...though there is one exception: _bt_first's _bt_preprocess_keys
+ * call can determine that the scan's input scan keys can never be
+ * satisfied. That might be true for one set of array keys, but not
+ * the next set.
+ *
+ * Handle this by advancing the array keys incrementally ourselves.
+ * When this succeeds, start another primitive index scan.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ Assert(!so->needPrimScan);
+ if (_bt_advance_array_keys_increment(scan, dir))
+ return true;
+
+ /* Array keys are now exhausted */
+ }
+
+ /*
+ * Has another primitive index scan been scheduled by _bt_checkkeys?
+ */
+ if (so->needPrimScan)
+ {
+ /* Yes -- tell caller to call _bt_first once again */
+ so->needPrimScan = false;
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_next_primitive_scan(scan);
+
+ return true;
+ }
+
+ /*
+ * No more primitive index scans. Terminate the top-level scan.
+ */
+ if (scan->parallel_scan != NULL)
+ _bt_parallel_done(scan);
+
+ return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Like _bt_check_compare, our return value indicates if tuple satisfied the
+ * qual (specifically our new qual). We also set pstate.continuescan=false
+ * for caller when the top-level index scan is over (when all required array
+ * keys are now exhausted). Otherwise, we'll set pstate.continuescan=true,
+ * indicating that top-level scan should proceed onto the next tuple. After
+ * we return, all further calls to _bt_check_compare will also use our new
+ * qual (a qual with newly advanced array key values, set here by us).
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys. Calling here before that
+ * point will prematurely advance the array keys, leading to wrong query
+ * results. (Actually, the case where the top-level scan ends might not
+ * advance the array keys, since there may be no further keys in the current
+ * scan direction.)
+ *
+ * We're responsible for ensuring that caller's tuple is <= current/newly
+ * advanced required array keys once we return (this postcondition is also
+ * checked via another assertion). We try to find an exact match, but failing
+ * that we'll advance the array keys to whatever set of keys comes next in the
+ * key space (among the keys that we actually have). Required array keys only
+ * ever "ratchet forwards", progressing in lock step with the scan itself.
+ *
+ * (The invariants are the same for backwards scans, except that the operators
+ * are flipped: just replace the precondition's >= operator with a <=, and the
+ * postcondition's <= operator with with a >=. In other words, just swap the
+ * precondition with the postcondition.)
+ *
+ * Note that we deal with all required equality strategy scan keys here; it's
+ * not limited to array scan keys. They're equality constraints for our
+ * purposes, and so are handled as degenerate single element arrays here.
+ * Obviously, they can never really advance in the way that real arrays can,
+ * but they must still affect how we advance real array scan keys, just like
+ * any other equality constraint. We have to keep around a 3-way ORDER proc
+ * for these (just using the "=" operator won't do), since in general whether
+ * the tuple is < or > some non-array equality key might influence advancement
+ * of any of the scan's actual arrays. The top-level scan can only terminate
+ * after it has processed the key space covered by the product of each and
+ * every equality constraint, including both non-arrays and (required) arrays.
+ * (Also, _bt_tuple_before_array_skeys needs to know the difference so that it
+ * can correctly suppress _bt_check_compare setting continuescan=false.)
+ *
+ * Note also that we may sometimes need to advance the array keys when the
+ * existing array keys are already an exact match for every corresponding
+ * value from caller's tuple according to _bt_check_compare. This is how we
+ * deal with inequalities that are required in the current scan direction.
+ * They can advance the array keys here, even though they don't influence the
+ * initial positioning strategy within _bt_first.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, int sktrig)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ ScanDirection dir = pstate->dir;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ ScanKey cur;
+ int ikey,
+ first_nonrequired_ikey PG_USED_FOR_ASSERTS_ONLY = -1,
+ arrayidx = 0,
+ ntupatts = BTreeTupleGetNAtts(tuple, rel);
+ bool arrays_advanced = false,
+ arrays_exhausted,
+ beyond_end_advance = false,
+ foundRequiredOppositeDirOnly = false,
+ all_eqtype_sk_equal = true,
+ all_required_eqtype_sk_equal PG_USED_FOR_ASSERTS_ONLY = true;
+
+ Assert(_bt_verify_keys_with_arraykeys(scan));
+
+ /*
+ * Try to advance array keys via a series of binary searches.
+ *
+ * Loop iterates through the current scankeys (so->keyData[], which were
+ * output by _bt_preprocess_keys earlier) and then sets input scan keys
+ * (so->arrayKeyData[] scan keys) to new array values.
+ */
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array = NULL;
+ ScanKey skeyarray = NULL;
+ FmgrInfo *orderproc;
+ int attnum = cur->sk_attno;
+ Datum tupdatum;
+ bool requiredSameDir = false,
+ requiredOppositeDirOnly = false,
+ tupnull;
+ int32 result;
+ int set_elem = 0;
+
+ /*
+ * Set up ORDER 3-way comparison function and array state
+ */
+ orderproc = &so->orderProcs[attnum - 1];
+ if (cur->sk_flags & SK_SEARCHARRAY &&
+ cur->sk_strategy == BTEqualStrategyNumber)
+ {
+ Assert(arrayidx < so->numArrayKeys);
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+ Assert(skeyarray->sk_attno == attnum);
+ }
+
+ if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+ ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+ requiredSameDir = true;
+ else if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
+ ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
+ requiredOppositeDirOnly = true;
+
+ /*
+ * Remember first non-required array scan key offset (for assertions)
+ */
+ if (!requiredSameDir && array && first_nonrequired_ikey == -1)
+ first_nonrequired_ikey = ikey;
+
+ /*
+ * Optimization: Skip over known-satisfied scan keys
+ */
+ if (ikey < sktrig)
+ continue;
+
+ /*
+ * When we come across an inequality scan key that's required in the
+ * opposite direction only, and is positioned after an unsatisfied
+ * scan key that's required in the current scan direction, remember it
+ */
+ if (requiredOppositeDirOnly)
+ {
+ Assert(ikey > sktrig);
+ Assert(cur->sk_strategy != BTEqualStrategyNumber);
+ Assert(!foundRequiredOppositeDirOnly);
+
+ foundRequiredOppositeDirOnly = true;
+
+ continue;
+ }
+
+ /*
+ * Other than that, we're not interested in scan keys that aren't
+ * required in the current scan direction (unless they're non-required
+ * array equality scan keys, which still need to be advanced by us)
+ */
+ if (!requiredSameDir && !array)
+ continue;
+
+ /*
+ * Whenever a required scan key triggers array key advancement within
+ * _bt_check_compare, the corresponding tuple attribute's value is
+ * typically < the scan key value (or > in the backwards scan case).
+ *
+ * If this is a required equality strategy scan key, this is just an
+ * optimization; we know that _bt_tuple_before_array_skeys has already
+ * determined that this scan key places us ahead of caller's tuple.
+ * There's no need to compare it a second time below.
+ *
+ * If this is a required inequality strategy scan key, we _must_ rely
+ * on _bt_check_compare like this; it knows all the intricacies around
+ * evaluating inequality strategy scan keys (e.g., row comparisons).
+ * There is no simple mapping onto the opclass ORDER proc we can use.
+ * But once we know that we have an unsatisfied inequality, we can
+ * treat it in the same way as an unsatisfied equality at this point.
+ *
+ * The arrays advance correctly in both cases because both involve the
+ * scan reaching the end of the key space for some array key (or some
+ * distinct set of array keys). The only difference is that in the
+ * equality strategy case the end is "between array keys", while in
+ * the inequality strategy case the end is "within an array key".
+ * Either way, we just advance higher order arrays by one increment.
+ *
+ * See below for a full explanation of "beyond end" advancement.
+ */
+ if (ikey == sktrig && !array)
+ {
+ Assert(requiredSameDir);
+ Assert(!arrays_advanced);
+
+ beyond_end_advance = true;
+
+ continue;
+ }
+
+ /*
+ * Nothing for us to do with a required inequality strategy scan key
+ * that wasn't the one that _bt_check_compare stopped on
+ */
+ if (cur->sk_strategy != BTEqualStrategyNumber)
+ continue;
+
+ /*
+ * Here we perform steps for all array scan keys after a required
+ * array scan key whose binary search triggered "beyond end of array
+ * element" array advancement due to encountering a tuple attribute
+ * value > the closest matching array key (or < for backwards scans).
+ *
+ * See below for a full explanation of "beyond end" advancement.
+ *
+ * NB: We must do this for all arrays -- not just required arrays.
+ * Otherwise the incremental array advancement step won't "carry".
+ */
+ if (beyond_end_advance)
+ {
+ int final_elem_dir;
+
+ if (ScanDirectionIsBackward(dir) || !array)
+ final_elem_dir = 0;
+ else
+ final_elem_dir = array->num_elems - 1;
+
+ if (array && array->cur_elem != final_elem_dir)
+ {
+ array->cur_elem = final_elem_dir;
+ skeyarray->sk_argument = array->elem_values[final_elem_dir];
+ arrays_advanced = true;
+ }
+
+ continue;
+ }
+
+ /*
+ * Here we perform steps for any required scan keys after the first
+ * required scan key whose tuple attribute was < the closest matching
+ * array key when we dealt with it (or > for backwards scans).
+ *
+ * This earlier required array key already puts us ahead of caller's
+ * tuple in the key space (for the current scan direction). We must
+ * make sure that subsequent lower-order array keys do not put us too
+ * far ahead (ahead of tuples that have yet to be seen by our caller).
+ * For example, when a tuple "(a, b) = (42, 5)" advances the array
+ * keys on "a" from 40 to 45, we must also set "b" to whatever the
+ * first array element for "b" is. It would be wrong to allow "b" to
+ * be set based on the tuple value.
+ *
+ * Perform the same steps with truncated high key attributes. You can
+ * think of this as a "binary search" for the element closest to the
+ * value -inf. Again, the arrays must never get ahead of the scan.
+ */
+ if (!all_eqtype_sk_equal || attnum > ntupatts)
+ {
+ int first_elem_dir;
+
+ if (ScanDirectionIsForward(dir) || !array)
+ first_elem_dir = 0;
+ else
+ first_elem_dir = array->num_elems - 1;
+
+ if (array && array->cur_elem != first_elem_dir)
+ {
+ array->cur_elem = first_elem_dir;
+ skeyarray->sk_argument = array->elem_values[first_elem_dir];
+ arrays_advanced = true;
+ }
+
+ /*
+ * Truncated -inf value will always be assumed to satisfy any
+ * required equality scan keys according to _bt_check_compare.
+ * Unset all_eqtype_sk_equal to avoid _bt_check_compare recheck.
+ *
+ * Deliberately don't unset all_required_eqtype_sk_equal here to
+ * avoid spurious postcondition assertion failures. We must
+ * follow _bt_tuple_before_array_skeys's example by not treating
+ * truncated attributes as having the exact value -inf.
+ */
+ all_eqtype_sk_equal = false;
+
+ continue;
+ }
+
+ /*
+ * Search in scankey's array for the corresponding tuple attribute
+ * value from caller's tuple
+ */
+ tupdatum = index_getattr(tuple, attnum, itupdesc, &tupnull);
+
+ if (array)
+ {
+ bool ratchets = (requiredSameDir && !arrays_advanced);
+
+ /*
+ * Binary search for closest match that's available from the array
+ */
+ set_elem = _bt_binsrch_array_skey(orderproc, ratchets, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
+
+ /*
+ * Required arrays only ever ratchet forwards (backwards).
+ *
+ * This condition makes it safe for binary searches to skip over
+ * array elements that the scan must already be ahead of by now.
+ * That is strictly an optimization. Our assertion verifies that
+ * the condition holds, which doesn't depend on the optimization.
+ */
+ Assert(!ratchets ||
+ ((ScanDirectionIsForward(dir) && set_elem >= array->cur_elem) ||
+ (ScanDirectionIsBackward(dir) && set_elem <= array->cur_elem)));
+ Assert(set_elem >= 0 && set_elem < array->num_elems);
+ }
+ else
+ {
+ Assert(requiredSameDir);
+
+ /*
+ * This is a required non-array equality strategy scan key, which
+ * we'll treat as a degenerate single value array.
+ *
+ * _bt_advance_array_keys_increment won't have an array for this
+ * scan key, but it can't matter. If you think about how real
+ * single value arrays roll over, you'll understand why this is.
+ */
+ result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
+ cur->sk_argument, cur);
+ }
+
+ /*
+ * Consider "beyond end of array element" array advancement.
+ *
+ * When the tuple attribute value is > the closest matching array key
+ * (or < in the backwards scan case), we need to ratchet this array
+ * forward (backward) by one increment, so that caller's tuple ends up
+ * being < final array value instead (or > final array value instead).
+ * See also: state machine postcondition assertions, below.
+ *
+ * This process has to work for all of the arrays, not just this one:
+ * it must "carry" to higher-order arrays when the set_elem that we
+ * just used for this array happens to have been the final element
+ * (for current scan direction). We can't just increment (decrement)
+ * set_elem itself and expect correct behavior -- at least not when
+ * there's more than one array to consider.
+ *
+ * Our approach is to set each subsequent/lower-order array to its
+ * final element. We'll then advance all array keys incrementally,
+ * just outside the loop. That way all earlier/higher order arrays
+ * (arrays _before_ this one) will advance as needed by rolling over.
+ *
+ * The array keys advance a little like the way that a mileage gauge
+ * advances. Imagine a mechanical display that rolls over from 999 to
+ * 000 every time we drive our car another 1,000 miles. Each decimal
+ * digit behaves a little like an array from the array state machine
+ * implemented by this function. (_bt_advance_array_keys_increment
+ * won't actually allow the most significant array to roll over, but
+ * that's just defensive.)
+ *
+ * Suppose we have 3 array keys a, b, and c. Each "digit"/array has
+ * 10 distinct elements that happen to match across each array: values
+ * 0 through to 9. Caller's tuple "(a, b, c) = (3, 7.9, 2)" might
+ * initially have its "b" array advanced up to the value 7 (because 7
+ * was matched by its binary search), and its "c" array advanced to 9.
+ * The final incremental advancement step (outside the loop) will then
+ * finish things off by "advancing" the array on "c" to 0, which then
+ * carries over to "b" (since "c" rolled over when it advanced). Once
+ * we're done we'll have "rounded up from 7.9 to 8" for the "b" array,
+ * without needing to directly alter its set_elem.
+ *
+ * The "a" array won't have advanced on this occasion, since the "b"
+ * array didn't roll over in turn. But it would given a tuple like
+ * "(a, b, c) = (3, 9.9, 4)". A tuple like "(a, b, c) = (9, 9.9, 8)"
+ * will eventually try (though fail) to roll over the array on "a".
+ * Failing to roll over everything like this exhausts all the arrays.
+ *
+ * Under this scheme required array keys only ever ratchet forwards
+ * (or backwards), and always do so to the maximum possible extent
+ * that we can know will be safe without seeing the scan's next tuple.
+ */
+ if (requiredSameDir &&
+ ((ScanDirectionIsForward(dir) && result > 0) ||
+ (ScanDirectionIsBackward(dir) && result < 0)))
+ beyond_end_advance = true;
+
+ /*
+ * Also track whether all relevant attributes from caller's tuple will
+ * be equal to the scan's array keys once we're done with it
+ */
+ if (result != 0)
+ {
+ all_eqtype_sk_equal = false;
+ if (requiredSameDir)
+ all_required_eqtype_sk_equal = false;
+ }
+
+ /*
+ * Optimization: If this call was triggered by a non-required array,
+ * and we know that tuple won't satisfy the qual, we give up right
+ * away. This often avoids advancing the array keys, which will save
+ * wasted cycles from calling _bt_update_keys_with_arraykeys below
+ * (plus it avoids needlessly unsetting pstate.finaltupchecked).
+ */
+ if (!all_eqtype_sk_equal && !requiredSameDir && sktrig == ikey)
+ {
+ Assert(!arrays_advanced);
+ Assert(!foundRequiredOppositeDirOnly);
+
+ break;
+ }
+
+ /* Advance array keys, even if set_elem isn't an exact match */
+ if (array && array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ skeyarray->sk_argument = array->elem_values[set_elem];
+ arrays_advanced = true;
+ }
+ }
+
+ /*
+ * Consider if we need to advance the array keys incrementally to finish
+ * off "beyond end of array element" array advancement. This is the only
+ * way that the array keys can be exhausted.
+ */
+ arrays_exhausted = false;
+ if (beyond_end_advance)
+ {
+ /* Non-required scan keys never exhaust arrays/end top-level scan */
+ Assert(sktrig < first_nonrequired_ikey ||
+ first_nonrequired_ikey == -1);
+
+ if (!_bt_advance_array_keys_increment(scan, dir))
+ arrays_exhausted = true;
+ else
+ arrays_advanced = true;
+
+ /*
+ * The newly advanced array keys won't be equal anymore, so remember
+ * that in order to avoid a second _bt_check_compare call for tuple
+ */
+ all_eqtype_sk_equal = all_required_eqtype_sk_equal = false;
+ }
+
+ if (arrays_advanced)
+ {
+ /*
+ * We advanced the array keys. Finalize everything by performing an
+ * in-place update of the scan's search-type scan keys.
+ *
+ * If we missed this final step then any call to _bt_check_compare
+ * would use stale array keys until such time as _bt_preprocess_keys
+ * was once again called by _bt_first.
+ */
+ _bt_update_keys_with_arraykeys(scan);
+
+ /*
+ * If any required array keys were advanced, be prepared to recheck
+ * the final tuple against the new array keys (as an optimization)
+ */
+ pstate->finaltupchecked = false;
+ }
+
+ /*
+ * State machine postcondition assertions.
+ *
+ * Tuple must now be <= current/newly advanced required array keys. Same
+ * goes for other required equality type scan keys, which are "degenerate
+ * single value arrays" for our purposes. (As usual the rule is the same
+ * for backwards scans once the operators are flipped around.)
+ *
+ * We're stricter than that in cases where the tuple was already equal to
+ * the previous array keys when we were called: tuple must now be < the
+ * new array keys (or > the array keys). This is a consequence of another
+ * rule: we must always advance the array keys by at least one increment
+ * (unless _bt_advance_array_keys_increment found that we'd exhausted all
+ * arrays, ending the top-level index scan).
+ *
+ * Our caller decides when to start primitive index scans based in part on
+ * the current array keys. It always needs to see a precise array-wise
+ * picture of the scan's progress. If we were to advance the array keys
+ * by less than the exact maximum safe amount, our caller might then make
+ * a subtly wrong decision about when to end the ongoing primitive scan.
+ * (These assertions won't reliably detect every case where the array keys
+ * haven't advanced by the expected/maximum amount, but they come close.)
+ */
+ Assert(_bt_verify_keys_with_arraykeys(scan));
+ Assert(arrays_exhausted ||
+ (_bt_tuple_before_array_skeys(scan, pstate, tuple, 0) ==
+ !all_required_eqtype_sk_equal));
+
+ /*
+ * If the array keys are now exhausted, end the top-level index scan
+ */
+ Assert(!so->needPrimScan);
+ if (arrays_exhausted)
+ {
+ /* Caller's tuple can't match new qual */
+ pstate->continuescan = false;
+ return false;
+ }
+
+ /*
+ * The array keys aren't exhausted, so provisionally assume that the
+ * current primitive index scan will continue
+ */
+ pstate->continuescan = true;
+
+ /*
+ * Does caller's tuple now match the new qual? Call _bt_check_compare a
+ * second time to find out (unless it's already clear that it can't).
+ */
+ if (all_eqtype_sk_equal)
+ {
+ bool continuescan;
+ int insktrig;
+
+ Assert(arrays_advanced);
+
+ if (likely(_bt_check_compare(dir, so, tuple, ntupatts, itupdesc,
+ &continuescan, &insktrig, false)))
+ return true;
+
+ /*
+ * Handle inequalities marked required in the current scan direction.
+ *
+ * It's just about possible that our _bt_check_compare call indicates
+ * that the scan should be terminated due to an unsatisfied inequality
+ * that wasn't initially recognized as such by us. Handle this by
+ * calling ourselves recursively while indicating that the trigger is
+ * now the inequality that we missed first time around.
+ *
+ * Note: we only need to do this in cases where the initial call to
+ * _bt_check_compare (that led to calling here) gave up upon finding
+ * an unsatisfied required equality/array scan key before it could
+ * reach the inequality. The second _bt_check_compare call took place
+ * after the array keys were advanced (to array keys that definitely
+ * match the tuple), so it can't have been overlooked a second time.
+ *
+ * Note: this is useful because we won't have to wait until the next
+ * tuple to advance the array keys a second time (to values that'll
+ * put the scan ahead of this tuple). Handling this ourselves isn't
+ * truly required. But it avoids complicating our contract. The only
+ * alternative is to allow an awkward exception to the general rule
+ * (the rule about always advancing the arrays to the maximum possible
+ * extent that caller's tuple can safely allow).
+ */
+ if (!continuescan)
+ {
+ Assert(insktrig > sktrig);
+ Assert(insktrig < first_nonrequired_ikey ||
+ first_nonrequired_ikey == -1);
+ return _bt_advance_array_keys(scan, pstate, tuple, insktrig);
+ }
+ }
+
+ /*
+ * Handle inequalities marked required in the opposite scan direction.
+ *
+ * If we advanced the array keys (which is now certain except in the case
+ * where we only needed to deal with non-required arrays), it's possible
+ * that the scan is now at the start of "matching" tuples (at least by the
+ * definition used by _bt_tuple_before_array_skeys), but is nevertheless
+ * still many leaf pages before the position that _bt_first is capable of
+ * repositioning the scan to.
+ *
+ * This can happen when we have an inequality scan key required in the
+ * opposite direction only, that's less significant than the scan key that
+ * triggered array advancement during our initial _bt_check_compare call.
+ * If even finaltup doesn't satisfy this less significant inequality scan
+ * key once we temporarily flip the scan direction, that indicates that
+ * even finaltup is before the _bt_first-wise initial position for these
+ * newly advanced array keys.
+ */
+ if (foundRequiredOppositeDirOnly && pstate->finaltup &&
+ !_bt_tuple_before_array_skeys(scan, pstate, pstate->finaltup, 0))
+ {
+ int nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
+ ScanDirection flipped = -dir;
+ bool continuescan;
+ int opsktrig;
+
+ Assert(arrays_advanced);
+
+ _bt_check_compare(flipped, so, pstate->finaltup, nfinaltupatts,
+ itupdesc, &continuescan, &opsktrig, false);
+
+ if (!continuescan && opsktrig > sktrig)
+ {
+ /*
+ * Continuing the ongoing primitive index scan as-is risks
+ * uselessly scanning a huge number of leaf pages from before the
+ * page that we'll quickly jump to by descending the index anew.
+ *
+ * Play it safe: start a new primitive index scan. _bt_first is
+ * guaranteed to at least move the scan to the next leaf page.
+ */
+ Assert(opsktrig < first_nonrequired_ikey ||
+ first_nonrequired_ikey == -1);
+ pstate->continuescan = false;
+ so->needPrimScan = true;
+
+ return false;
+ }
+
+ /*
+ * Caller's tuple might still be before the _bt_first-wise start of
+ * matches for the new array keys, but at least finaltup is at or
+ * ahead of that position. That's good enough; continue as-is.
+ */
+ }
+
+ /*
+ * Caller's tuple is < the newly advanced array keys (or > when this is a
+ * backwards scan).
+ *
+ * It's possible that later tuples will also turn out to have values that
+ * are still < the now-current array keys (or > the current array keys).
+ * Our caller will handle this by performing what amounts to a linear
+ * search of the page, implemented by calling _bt_check_compare and then
+ * _bt_tuple_before_array_skeys for each tuple. Our caller should locate
+ * the first tuple >= the array keys before long (or locate the first
+ * tuple <= the array keys before long).
+ *
+ * This approach has various advantages over a binary search of the page.
+ * We expect that our caller will either quickly discover the next tuple
+ * covered by the current array keys, or quickly discover that it needs
+ * another primitive index scan (using its finaltup precheck) instead.
+ * Either way, a binary search is unlikely to beat a simple linear search.
+ *
+ * It's also not clear that a binary search will be any faster when we
+ * really do have to search through hundreds of tuples beyond this one.
+ * Several binary searches (one per array advancement) might be required
+ * while reading through a single page. Our linear search is structured
+ * as one continuous search that just advances the arrays in passing, and
+ * that only needs a little extra logic to deal with inequality scan keys.
+ */
+ return false;
+}
/*
* _bt_preprocess_keys() -- Preprocess scan keys
@@ -749,6 +1925,19 @@ _bt_restore_array_keys(IndexScanDesc scan)
* Again, missing cross-type operators might cause us to fail to prove the
* quals contradictory when they really are, but the scan will work correctly.
*
+ * Index scans with array keys need to be able to advance each array's keys
+ * and make them the current search-type scan keys without calling here. They
+ * expect to be able to call _bt_update_keys_with_arraykeys instead. We need
+ * to be careful about that case when we determine redundancy; equality quals
+ * must not be eliminated as redundant on the basis of array input keys that
+ * might change before another call here can take place.
+ *
+ * Note, however, that the presence of an array scan key doesn't affect how we
+ * determine if index quals are contradictory. Contradictory qual scans move
+ * on to the next primitive index scan right away, by incrementing the scan's
+ * array keys once control reaches _bt_array_keys_remain. There won't be a
+ * call to _bt_update_keys_with_arraykeys, so there's nothing for us to break.
+ *
* Row comparison keys are currently also treated without any smarts:
* we just transfer them into the preprocessed array without any
* editorialization. We can treat them the same as an ordinary inequality
@@ -895,8 +2084,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
so->qual_ok = false;
return;
}
- /* else discard the redundant non-equality key */
- xform[j] = NULL;
+ else if (!(eq->sk_flags & SK_SEARCHARRAY))
+ {
+ /* else discard the redundant non-equality key */
+ xform[j] = NULL;
+ }
}
/* else, cannot determine redundancy, keep both keys */
}
@@ -986,6 +2178,22 @@ _bt_preprocess_keys(IndexScanDesc scan)
continue;
}
+ /*
+ * Is this an array scan key that _bt_preprocess_array_keys merged
+ * with some earlier array key during its initial preprocessing pass?
+ */
+ if (cur->sk_flags & SK_BT_RDDNARRAY)
+ {
+ /*
+ * key is redundant for this primitive index scan (and will be
+ * redundant during all subsequent primitive index scans)
+ */
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(j == (BTEqualStrategyNumber - 1));
+ Assert(so->numArrayKeys > 0);
+ continue;
+ }
+
/* have we seen one of these before? */
if (xform[j] == NULL)
{
@@ -999,7 +2207,26 @@ _bt_preprocess_keys(IndexScanDesc scan)
&test_result))
{
if (test_result)
- xform[j] = cur;
+ {
+ if (j == (BTEqualStrategyNumber - 1) &&
+ ((xform[j]->sk_flags & SK_SEARCHARRAY) ||
+ (cur->sk_flags & SK_SEARCHARRAY)))
+ {
+ /*
+ * Must never replace an = array operator ourselves,
+ * nor can we ever fail to remember an = array
+ * operator. _bt_update_keys_with_arraykeys expects
+ * this.
+ */
+ ScanKey outkey = &outkeys[new_numberOfKeys++];
+
+ memcpy(outkey, cur, sizeof(ScanKeyData));
+ if (numberOfEqualCols == attno - 1)
+ _bt_mark_scankey_required(outkey);
+ }
+ else
+ xform[j] = cur;
+ }
else if (j == (BTEqualStrategyNumber - 1))
{
/* key == a && key == b, but a != b */
@@ -1027,6 +2254,98 @@ _bt_preprocess_keys(IndexScanDesc scan)
so->numberOfKeys = new_numberOfKeys;
}
+/*
+ * _bt_update_keys_with_arraykeys() -- Finalize advancing array keys
+ *
+ * This function just transfers newly advanced array keys that were set in
+ * "so->arrayKeyData[]" over to corresponding "so->keyData[]" scan keys. This
+ * avoids the full set of push-ups that take place in _bt_preprocess_keys at
+ * the start of each new primitive index scan. In particular, it avoids doing
+ * anything that would be considered unsafe while holding a buffer lock.
+ *
+ * Note that _bt_preprocess_keys is aware of our special requirements when
+ * considering if quals are redundant. For full details see comments above
+ * _bt_preprocess_array_keys (and above _bt_preprocess_keys itself).
+ */
+static void
+_bt_update_keys_with_arraykeys(IndexScanDesc scan)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0;
+
+ Assert(so->qual_ok);
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array;
+ ScanKey skeyarray;
+
+ Assert((cur->sk_flags & SK_BT_RDDNARRAY) == 0);
+
+ /* Just update equality array scan keys */
+ if (cur->sk_strategy != BTEqualStrategyNumber ||
+ !(cur->sk_flags & SK_SEARCHARRAY))
+ continue;
+
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+
+ /* Update the scan key's argument */
+ Assert(cur->sk_attno == skeyarray->sk_attno);
+ cur->sk_argument = skeyarray->sk_argument;
+ }
+
+ Assert(arrayidx == so->numArrayKeys);
+}
+
+/*
+ * Verify that the scan's "so->arrayKeyData[]" scan keys are in agreement with
+ * the current "so->keyData[]" search-type scan keys. Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_verify_keys_with_arraykeys(IndexScanDesc scan)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ScanKey cur;
+ int ikey,
+ arrayidx = 0;
+
+ if (!so->qual_ok)
+ return false;
+
+ for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+ {
+ BTArrayKeyInfo *array;
+ ScanKey skeyarray;
+
+ if (cur->sk_strategy != BTEqualStrategyNumber ||
+ !(cur->sk_flags & SK_SEARCHARRAY))
+ continue;
+
+ array = &so->arrayKeys[arrayidx++];
+ skeyarray = &so->arrayKeyData[array->scan_key];
+
+ /* Verify so->arrayKeyData[] input key has expected sk_argument */
+ if (skeyarray->sk_argument != array->elem_values[array->cur_elem])
+ return false;
+
+ /* Verify so->arrayKeyData[] input key agrees with output key */
+ if (cur->sk_attno != skeyarray->sk_attno)
+ return false;
+ if (cur->sk_argument != skeyarray->sk_argument)
+ return false;
+ }
+
+ if (arrayidx != so->numArrayKeys)
+ return false;
+
+ return true;
+}
+#endif
+
/*
* Compare two scankey values using a specified operator.
*
@@ -1360,58 +2679,267 @@ _bt_mark_scankey_required(ScanKey skey)
*
* Return true if so, false if not. If the tuple fails to pass the qual,
* we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly. See comments for
+ * this tuple, and set pstate.continuescan accordingly. See comments for
* _bt_preprocess_keys(), above, about how this is done.
*
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the final
+ * tuple (the high key for a forward scan) early, before we've expended too
+ * much effort on comparing tuples that cannot possibly be matches for any set
+ * of array keys. This is just an optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate. These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards). Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction). Any other order will
+ * lead to inconsistent array key state.
*
* scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
* tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
* requiredMatchedByPrecheck: indicates that scan keys required for
* direction scan are already matched
*/
bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
- ScanDirection dir, bool *continuescan,
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+ IndexTuple tuple, bool finaltup,
bool requiredMatchedByPrecheck)
{
- TupleDesc tupdesc;
- BTScanOpaque so;
- int keysz;
+ TupleDesc tupdesc = RelationGetDescr(scan->indexRelation);
+ int natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ bool res;
+ int sktrig;
+
+ Assert(pstate->continuescan);
+ Assert(!so->needPrimScan);
+
+ res = _bt_check_compare(pstate->dir, so, tuple, natts, tupdesc,
+ &pstate->continuescan, &sktrig,
+ requiredMatchedByPrecheck);
+
+ /*
+ * Only one _bt_check_compare call is required in the common case where
+ * there are no equality-type array scan keys. Otherwise we can only
+ * accept _bt_check_compare's answer unreservedly when it didn't set
+ * continuescan=false.
+ */
+ if (!so->numArrayKeys || pstate->continuescan)
+ return res;
+
+ /*
+ * _bt_check_compare call set continuescan=false in the presence of
+ * equality type array keys.
+ *
+ * While we might really need to end the top-level index scan, most of the
+ * time this just means that the scan needs to reconsider its array keys.
+ */
+ if (_bt_tuple_before_array_skeys(scan, pstate, tuple, sktrig))
+ {
+ /*
+ * Current tuple is < the current array scan keys/equality constraints
+ * (or > in the backward scan case). Don't need to advance the array
+ * keys. Must decide whether to start a new primitive scan instead.
+ *
+ * If this tuple isn't the finaltup for the page, then recheck the
+ * finaltup stashed in pstate as an optimization. That allows us to
+ * quit scanning this page early when it's clearly hopeless (we don't
+ * need to wait for the finaltup call to give up on a primitive scan).
+ */
+ if (finaltup || (!pstate->finaltupchecked && pstate->finaltup &&
+ _bt_tuple_before_array_skeys(scan, pstate,
+ pstate->finaltup, 0)))
+ {
+ /*
+ * Give up on the ongoing primitive index scan.
+ *
+ * Even the final tuple (the high key for forward scans, or the
+ * tuple from page offset number 1 for backward scans) is before
+ * the current array keys. That strongly suggests that continuing
+ * this primitive scan would be less efficient than starting anew.
+ *
+ * See also: finaltup remarks after the _bt_advance_array_keys
+ * call below, which fully explain our policy around how and when
+ * primitive index scans end.
+ */
+ pstate->continuescan = false;
+
+ /*
+ * Set up a new primitive index scan that will reposition the
+ * top-level scan to the first leaf page whose key space is
+ * covered by our array keys. The top-level scan will "skip" a
+ * part of the index that can only contain non-matching tuples.
+ *
+ * Note: the next primitive index scan is guaranteed to land on
+ * some later leaf page (ideally it won't be this page's sibling).
+ * It follows that the top-level scan can never access the same
+ * leaf page more than once (unless the scan changes direction or
+ * btrestrpos is called). btcostestimate relies on this.
+ */
+ so->needPrimScan = true;
+ }
+ else
+ {
+ /*
+ * Stick with the ongoing primitive index scan, for now (override
+ * _bt_check_compare's suggestion that we end the scan).
+ *
+ * Note: we will end up here again and again given a group of
+ * tuples > the previous array keys and < the now-current keys
+ * (though only after an initial finaltup precheck determined that
+ * this page definitely covers key space from both array keysets).
+ * In effect, we perform a linear search of the page's remaining
+ * unscanned tuples every time the arrays advance past the key
+ * space of the scan's then-current tuple.
+ */
+ pstate->continuescan = true;
+
+ /*
+ * Our finaltup precheck determined that it is >= the current keys
+ * (although the current tuple is still < the current array keys).
+ *
+ * Remember that fact in pstate now. This avoids wasting cycles
+ * on repeating the same precheck step (checking the same finaltup
+ * against the same array keys) during later calls here for later
+ * tuples from this same leaf page.
+ */
+ pstate->finaltupchecked = true;
+ }
+
+ /* In any case, this indextuple doesn't match the qual */
+ return false;
+ }
+
+ /*
+ * Caller's tuple is >= the current set of array keys and other equality
+ * constraint scan keys (or <= if this is a backwards scans). It's now
+ * clear that we _must_ advance any required array keys in lockstep with
+ * the scan (or at least notice that the required array keys have been
+ * exhausted, which will end the top-level scan).
+ *
+ * Note: we might even advance the arrays when all existing keys are
+ * already equal to the values from the tuple at this point. See comments
+ * about inequality-driven array advancement above _bt_advance_array_keys.
+ */
+ if (_bt_advance_array_keys(scan, pstate, tuple, sktrig))
+ {
+ /* Tuple (which didn't match the old qual) now matches the new qual */
+ Assert(pstate->continuescan);
+ return true;
+ }
+
+ /*
+ * At this point we've either advanced the array keys beyond the tuple, or
+ * exhausted all array keys (which will end the top-level index scan).
+ * Either way, this index tuple doesn't match the new qual.
+ *
+ * The array keys usually advance using a tuple from before finaltup
+ * (there can only be one finaltup per page, of course). In the common
+ * case where we just advanced the array keys during a !finaltup call, we
+ * can be sure that there'll be at least one more opportunity to check the
+ * new array keys against another tuple from this same page. Things are
+ * more complicated for finaltup calls that advance the array keys at a
+ * page boundary. They'll often advance the arrays to values > finaltup,
+ * leaving us with no reliable information about the physical proximity of
+ * the first leaf page where matches for the new keys are to be found.
+ *
+ * Our policy is to allow our caller to move on to the next sibling page
+ * in these cases. This is speculative, in a way: it's always possible
+ * that the array keys will have advanced well beyond the key space
+ * covered by the next sibling page. And if it turns out like that then
+ * our caller will incur a wasted leaf page access.
+ *
+ * In practice this policy wins significantly more often than it loses.
+ * The fact that the final tuple advanced the array keys is an encouraging
+ * signal -- especially during forwards scans, where our high key/pivot
+ * finaltup has values derived from the right sibling's firstright tuple.
+ * This issue is quite likely to come up whenever multiple array keys are
+ * used by forward scans. There is a decent chance that every finaltup
+ * from every page will have at least one truncated -inf attribute, which
+ * makes it impossible for finaltup array advancement to advance the lower
+ * order arrays to exactly matching array elements. Workloads like that
+ * would see poor performance from a policy that conditions going to the
+ * next sibling page on having an exactly-matching finaltup on this page.
+ *
+ * Cases where continuing the scan onto the next sibling page is a bad
+ * idea usually quit scanning the page before even reaching finaltup; just
+ * making it as far as finaltup is a useful cue in its own right. This is
+ * partly due to a promise that _bt_advance_array_keys makes: it always
+ * advances the scan's array keys to the maximum possible extent that is
+ * sure to be safe, given what is known about the scan when it is called
+ * (namely the scan's current tuple and its array keys, though _not_ the
+ * next tuple whose key space is covered by any of the scan's arrays).
+ * That factor limits array advancement using finaltup to cases where no
+ * earlier tuple could bump the array keys to key space beyond finaltup,
+ * despite being given every opportunity to do so by us (with some help
+ * from _bt_advance_array_keys).
+ *
+ * Chances are good that finaltup won't be all that different to earlier
+ * nearby tuples: it is unlikely to make the tuple-wise position that
+ * matching tuples start at jump forward by a great many tuples, either.
+ * In particular, it is unlikely to jump by more tuples than caller will
+ * find on the next leaf page. That's why it makes sense to allow the
+ * ongoing primitive index scan to at least continue to the next page.
+ */
+
+ /* In any case, tuple doesn't match the new qual, either */
+ return false;
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not. If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys. It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan. It is up to our caller (which has more high
+ * level context than us) to override that initial determination when it makes
+ * more sense to advance the array keys and continue with further tuples from
+ * the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, BTScanOpaque so,
+ IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+ bool *continuescan, int *sktrig,
+ bool requiredMatchedByPrecheck)
+{
int ikey;
ScanKey key;
- Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+ Assert(!so->numArrayKeys || !requiredMatchedByPrecheck);
*continuescan = true; /* default assumption */
+ *sktrig = 0; /* default assumption */
- tupdesc = RelationGetDescr(scan->indexRelation);
- so = (BTScanOpaque) scan->opaque;
- keysz = so->numberOfKeys;
-
- for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+ for (key = so->keyData, ikey = 0; ikey < so->numberOfKeys; key++, ikey++)
{
Datum datum;
bool isNull;
Datum test;
bool requiredSameDir = false,
- requiredOppositeDir = false;
+ requiredOppositeDirOnly = false;
/*
* Check if the key is required for ordered scan in the same or
- * opposite direction. Save as flag variables for future usage.
+ * opposite direction. Also set an offset to this scan key for caller
+ * in case it stops the scan (used by scans that have array keys).
*/
if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
requiredSameDir = true;
else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
- requiredOppositeDir = true;
+ requiredOppositeDirOnly = true;
+ *sktrig = ikey;
/*
* Is the key required for scanning for either forward or backward
@@ -1419,7 +2947,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* known to be matched, skip the check. Except for the row keys,
* where NULLs could be found in the middle of matching values.
*/
- if ((requiredSameDir || requiredOppositeDir) &&
+ if ((requiredSameDir || requiredOppositeDirOnly) &&
!(key->sk_flags & SK_ROW_HEADER) && requiredMatchedByPrecheck)
continue;
@@ -1522,11 +3050,28 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
/*
* Apply the key checking function. When the key is required for
- * opposite direction scan, it must be already satisfied by
- * _bt_first() except for the NULLs checking, which have already done
- * above.
+ * opposite-direction scans it must be an inequality satisfied by
+ * _bt_first(), barring NULLs, which we just checked a moment ago.
+ *
+ * (Also can't apply this optimization with scans that use arrays,
+ * since _bt_advance_array_keys() sometimes allows the scan to see a
+ * few tuples from before the would-be _bt_first() starting position
+ * for the scan's just-advanced array keys.)
+ *
+ * Even required equality quals (that can't use this optimization due
+ * to being required in both scan directions) rely on the assumption
+ * that _bt_first() will always use the quals for initial positioning
+ * purposes. We stop the scan as soon as any required equality qual
+ * fails, so it had better only happen at the end of equal tuples in
+ * the current scan direction (never at the start of equal tuples).
+ * See comments in _bt_first().
+ *
+ * (The required equality quals issue also has specific implications
+ * for scans that use arrays. They sometimes perform a linear search
+ * of remaining unscanned tuples, forcing the primitive index scan to
+ * continue until it locates tuples >= the scan's new array keys.)
*/
- if (!requiredOppositeDir)
+ if (!requiredOppositeDirOnly || so->numArrayKeys)
{
test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
datum, key->sk_argument);
@@ -1544,15 +3089,25 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* Tuple fails this qual. If it's a required qual for the current
* scan direction, then we can conclude no further tuples will
* pass, either.
- *
- * Note: because we stop the scan as soon as any required equality
- * qual fails, it is critical that equality quals be used for the
- * initial positioning in _bt_first() when they are available. See
- * comments in _bt_first().
*/
if (requiredSameDir)
*continuescan = false;
+ /*
+ * Always set continuescan=false for equality-type array keys that
+ * don't pass -- even for an array scan key not marked required.
+ *
+ * A non-required scan key (array or otherwise) can never actually
+ * terminate the scan. It's just convenient for callers to treat
+ * continuescan=false as a signal that it might be time to advance
+ * the array keys, independent of whether they're required or not.
+ * (Even setting continuescan=false with a required scan key won't
+ * usually end a scan that uses arrays.)
+ */
+ if ((key->sk_flags & SK_SEARCHARRAY) &&
+ key->sk_strategy == BTEqualStrategyNumber)
+ *continuescan = false;
+
/*
* In any case, this indextuple doesn't match the qual.
*/
@@ -1571,7 +3126,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* it's not possible for any future tuples in the current scan direction
* to pass the qual.
*
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.
*/
static bool
_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 03a5fbdc6..e37597c26 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop);
+ bool *skip_nonnative_saop);
static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
List *clauses, List *other_clauses);
static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
* index AM supports them natively, we should just include them in simple
* index paths. If not, we should exclude them while building simple index
* paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
*/
static void
get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
{
List *indexpaths;
bool skip_nonnative_saop = false;
- bool skip_lower_saop = false;
ListCell *lc;
/*
* Build simple index paths using the clauses. Allow ScalarArrayOpExpr
- * clauses only if the index AM supports them natively, and skip any such
- * clauses for index columns after the first (so that we produce ordered
- * paths if possible).
+ * clauses only if the index AM supports them natively.
*/
indexpaths = build_index_paths(root, rel,
index, clauses,
index->predOK,
ST_ANYSCAN,
- &skip_nonnative_saop,
- &skip_lower_saop);
-
- /*
- * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
- * that supports them, then try again including those clauses. This will
- * produce paths with more selectivity but no ordering.
- */
- if (skip_lower_saop)
- {
- indexpaths = list_concat(indexpaths,
- build_index_paths(root, rel,
- index, clauses,
- index->predOK,
- ST_ANYSCAN,
- &skip_nonnative_saop,
- NULL));
- }
+ &skip_nonnative_saop);
/*
* Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
index, clauses,
false,
ST_BITMAPSCAN,
- NULL,
NULL);
*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
* to true if we found any such clauses (caller must initialize the variable
* to false). If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
*
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false). If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
* 'rel' is the index's heap relation
* 'index' is the index for which we want to generate paths
* 'clauses' is the collection of indexable clauses (IndexClause nodes)
* 'useful_predicate' indicates whether the index has a useful predicate
* 'scantype' indicates whether we need plain or bitmap scan support
* 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
*/
static List *
build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexOptInfo *index, IndexClauseSet *clauses,
bool useful_predicate,
ScanTypeControl scantype,
- bool *skip_nonnative_saop,
- bool *skip_lower_saop)
+ bool *skip_nonnative_saop)
{
List *result = NIL;
IndexPath *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
List *orderbyclausecols;
List *index_pathkeys;
List *useful_pathkeys;
- bool found_lower_saop_clause;
bool pathkeys_possibly_useful;
bool index_is_ordered;
bool index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
* on by btree and possibly other places.) The list can be empty, if the
* index AM allows that.
*
- * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
- * index clause for a non-first index column. This prevents us from
- * assuming that the scan result is ordered. (Actually, the result is
- * still ordered if there are equality constraints for all earlier
- * columns, but it seems too expensive and non-modular for this code to be
- * aware of that refinement.)
- *
* We also build a Relids set showing which outer rels are required by the
* selected clauses. Any lateral_relids are included in that, but not
* otherwise accounted for.
*/
index_clauses = NIL;
- found_lower_saop_clause = false;
outer_relids = bms_copy(rel->lateral_relids);
for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
{
@@ -903,30 +862,20 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
IndexClause *iclause = (IndexClause *) lfirst(lc);
RestrictInfo *rinfo = iclause->rinfo;
- /* We might need to omit ScalarArrayOpExpr clauses */
- if (IsA(rinfo->clause, ScalarArrayOpExpr))
+ /*
+ * We might need to omit ScalarArrayOpExpr clauses when index AM
+ * lacks native support
+ */
+ if (!index->amsearcharray && IsA(rinfo->clause, ScalarArrayOpExpr))
{
- if (!index->amsearcharray)
+ if (skip_nonnative_saop)
{
- if (skip_nonnative_saop)
- {
- /* Ignore because not supported by index */
- *skip_nonnative_saop = true;
- continue;
- }
- /* Caller had better intend this only for bitmap scan */
- Assert(scantype == ST_BITMAPSCAN);
- }
- if (indexcol > 0)
- {
- if (skip_lower_saop)
- {
- /* Caller doesn't want to lose index ordering */
- *skip_lower_saop = true;
- continue;
- }
- found_lower_saop_clause = true;
+ /* Ignore because not supported by index */
+ *skip_nonnative_saop = true;
+ continue;
}
+ /* Caller had better intend this only for bitmap scan */
+ Assert(scantype == ST_BITMAPSCAN);
}
/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
/*
* 2. Compute pathkeys describing index's ordering, if any, then see how
* many of them are actually useful for this query. This is not relevant
- * if we are only trying to build bitmap indexscans, nor if we have to
- * assume the scan is unordered.
+ * if we are only trying to build bitmap indexscans.
*/
pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
- !found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
index, &clauseset,
useful_predicate,
ST_BITMAPSCAN,
- NULL,
NULL);
result = list_concat(result, indexpaths);
}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 35c9e3c86..2b622b7a5 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6512,8 +6512,6 @@ genericcostestimate(PlannerInfo *root,
double numIndexTuples;
double spc_random_page_cost;
double num_sa_scans;
- double num_outer_scans;
- double num_scans;
double qual_op_cost;
double qual_arg_cost;
List *selectivityQuals;
@@ -6528,7 +6526,7 @@ genericcostestimate(PlannerInfo *root,
/*
* Check for ScalarArrayOpExpr index quals, and estimate the number of
- * index scans that will be performed.
+ * primitive index scans that will be performed for caller
*/
num_sa_scans = 1;
foreach(l, indexQuals)
@@ -6558,19 +6556,8 @@ genericcostestimate(PlannerInfo *root,
*/
numIndexTuples = costs->numIndexTuples;
if (numIndexTuples <= 0.0)
- {
numIndexTuples = indexSelectivity * index->rel->tuples;
- /*
- * The above calculation counts all the tuples visited across all
- * scans induced by ScalarArrayOpExpr nodes. We want to consider the
- * average per-indexscan number, so adjust. This is a handy place to
- * round to integer, too. (If caller supplied tuple estimate, it's
- * responsible for handling these considerations.)
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
- }
-
/*
* We can bound the number of tuples by the index size in any case. Also,
* always estimate at least one tuple is touched, even when
@@ -6608,27 +6595,31 @@ genericcostestimate(PlannerInfo *root,
*
* The above calculations are all per-index-scan. However, if we are in a
* nestloop inner scan, we can expect the scan to be repeated (with
- * different search keys) for each row of the outer relation. Likewise,
- * ScalarArrayOpExpr quals result in multiple index scans. This creates
- * the potential for cache effects to reduce the number of disk page
- * fetches needed. We want to estimate the average per-scan I/O cost in
- * the presence of caching.
+ * different search keys) for each row of the outer relation. This
+ * creates the potential for cache effects to reduce the number of disk
+ * page fetches needed. We want to estimate the average per-scan I/O cost
+ * in the presence of caching.
*
* We use the Mackert-Lohman formula (see costsize.c for details) to
* estimate the total number of page fetches that occur. While this
* wasn't what it was designed for, it seems a reasonable model anyway.
* Note that we are counting pages not tuples anymore, so we take N = T =
* index size, as if there were one "tuple" per page.
+ *
+ * Note: we assume that there will be no repeat index page fetches across
+ * ScalarArrayOpExpr primitive scans from the same logical index scan.
+ * This is guaranteed to be true for btree indexes, but is very optimistic
+ * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+ * However, these same index AMs also accept our default pessimistic
+ * approach to counting num_sa_scans (btree caller caps this), so we don't
+ * expect the final indexTotalCost to be wildly over-optimistic.
*/
- num_outer_scans = loop_count;
- num_scans = num_sa_scans * num_outer_scans;
-
- if (num_scans > 1)
+ if (loop_count > 1)
{
double pages_fetched;
/* total page fetches ignoring cache effects */
- pages_fetched = numIndexPages * num_scans;
+ pages_fetched = numIndexPages * loop_count;
/* use Mackert and Lohman formula to adjust for cache effects */
pages_fetched = index_pages_fetched(pages_fetched,
@@ -6638,11 +6629,9 @@ genericcostestimate(PlannerInfo *root,
/*
* Now compute the total disk access cost, and then report a pro-rated
- * share for each outer scan. (Don't pro-rate for ScalarArrayOpExpr,
- * since that's internal to the indexscan.)
+ * share for each outer scan
*/
- indexTotalCost = (pages_fetched * spc_random_page_cost)
- / num_outer_scans;
+ indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
}
else
{
@@ -6658,10 +6647,8 @@ genericcostestimate(PlannerInfo *root,
* evaluated once at the start of the scan to reduce them to runtime keys
* to pass to the index AM (see nodeIndexscan.c). We model the per-tuple
* CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
- * indexqual operator. Because we have numIndexTuples as a per-scan
- * number, we have to multiply by num_sa_scans to get the correct result
- * for ScalarArrayOpExpr cases. Similarly add in costs for any index
- * ORDER BY expressions.
+ * indexqual operator. Similarly add in costs for any index ORDER BY
+ * expressions.
*
* Note: this neglects the possible costs of rechecking lossy operators.
* Detecting that that might be needed seems more expensive than it's
@@ -6674,7 +6661,7 @@ genericcostestimate(PlannerInfo *root,
indexStartupCost = qual_arg_cost;
indexTotalCost += qual_arg_cost;
- indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+ indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
/*
* Generic assumption about index correlation: there isn't any.
@@ -6752,7 +6739,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
bool eqQualHere;
bool found_saop;
bool found_is_null_op;
- double num_sa_scans;
ListCell *lc;
/*
@@ -6767,17 +6753,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
*
* For a RowCompareExpr, we consider only the first column, just as
* rowcomparesel() does.
- *
- * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
- * index scans not one, but the ScalarArrayOpExpr's operator can be
- * considered to act the same as it normally does.
*/
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
found_saop = false;
found_is_null_op = false;
- num_sa_scans = 1;
foreach(lc, path->indexclauses)
{
IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6817,14 +6798,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
else if (IsA(clause, ScalarArrayOpExpr))
{
ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
- Node *other_operand = (Node *) lsecond(saop->args);
- int alength = estimate_array_length(other_operand);
clause_op = saop->opno;
found_saop = true;
- /* count number of SA scans induced by indexBoundQuals only */
- if (alength > 1)
- num_sa_scans *= alength;
}
else if (IsA(clause, NullTest))
{
@@ -6884,13 +6860,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
JOIN_INNER,
NULL);
numIndexTuples = btreeSelectivity * index->rel->tuples;
-
- /*
- * As in genericcostestimate(), we have to adjust for any
- * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
- * to integer.
- */
- numIndexTuples = rint(numIndexTuples / num_sa_scans);
}
/*
@@ -6900,6 +6869,48 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
genericcostestimate(root, path, loop_count, &costs);
+ /*
+ * Now compensate for btree's ability to efficiently execute scans with
+ * SAOP clauses.
+ *
+ * btree automatically combines individual ScalarArrayOpExpr primitive
+ * index scans whenever the tuples covered by the next set of array keys
+ * are close to tuples covered by the current set. This makes the final
+ * number of descents particularly difficult to estimate. However, btree
+ * scans never visit any single leaf page more than once. That puts a
+ * natural floor under the worst case number of descents.
+ *
+ * It's particularly important that we not wildly overestimate the number
+ * of descents needed for a clause list with several SAOPs -- the costs
+ * really aren't multiplicative in the way genericcostestimate expects. In
+ * general, most distinct combinations of SAOP keys will tend to not find
+ * any matching tuples. Furthermore, btree scans search for the next set
+ * of array keys using the next tuple in line, and so won't even need a
+ * direct comparison to eliminate most non-matching sets of array keys.
+ *
+ * Clamp the number of descents to the estimated number of leaf page
+ * visits. This is still fairly pessimistic, but tends to result in more
+ * accurate costing of scans with several SAOP clauses -- especially when
+ * each array has more than a few elements. The cost of adding additional
+ * array constants to a low-order SAOP column should saturate past a
+ * certain point (except where selectivity estimates continue to shift).
+ *
+ * Also clamp the number of descents to 1/3 the number of index pages.
+ * This avoids implausibly high estimates with low selectivity paths,
+ * where scans frequently require no more than one or two descents.
+ *
+ * XXX Ideally, we'd also account for the fact that non-boundary SAOP
+ * clause quals (which the B-Tree code uses "non-required" scan keys for)
+ * won't actually contribute to the total number of descents of the index.
+ * This would require pushing down more context into genericcostestimate.
+ */
+ if (costs.num_sa_scans > 1)
+ {
+ costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+ costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+ costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+ }
+
/*
* Add a CPU-cost component to represent the costs of initial btree
* descent. We don't charge any I/O cost for touching upper btree levels,
@@ -6907,9 +6918,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* comparisons to descend a btree of N leaf tuples. We charge one
* cpu_operator_cost per comparison.
*
- * If there are ScalarArrayOpExprs, charge this once per SA scan. The
- * ones after the first one are not startup cost so far as the overall
- * plan is concerned, so add them only to "total" cost.
+ * If there are ScalarArrayOpExprs, charge this once per estimated
+ * primitive SA scan. The ones after the first one are not startup cost
+ * so far as the overall plan goes, so just add them to "total" cost.
*/
if (index->tuples > 1) /* avoid computing log(0) */
{
@@ -6926,7 +6937,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* in cases where only a single leaf page is expected to be visited. This
* cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
* touched. The number of such pages is btree tree height plus one (ie,
- * we charge for the leaf page too). As above, charge once per SA scan.
+ * we charge for the leaf page too). As above, charge once per estimated
+ * primitive SA scan.
*/
descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 42509042a..1515bbd40 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</para>
</note>
+ <note>
+ <para>
+ Every time an index is searched, the index's
+ <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+ field is incremented. This usually happens once per index scan node
+ execution, but might take place several times during execution of a scan
+ that searches for multiple values together. Only queries that use certain
+ <acronym>SQL</acronym> constructs to search for rows matching any value
+ out of a list (or an array) of multiple scalar values are affected. See
+ <xref linkend="functions-comparisons"/> for details.
+ </para>
+ </note>
+
</sect2>
<sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
(1 row)
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
--------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------
Index Only Scan using tenk1_thous_tenthous on tenk1
- Index Cond: (thousand < 2)
- Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
SET enable_indexonlyscan = OFF;
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
- Sort Key: thousand
- -> Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous
+----------+----------
+ 1 | 1001
+ 0 | 3000
+(2 rows)
+
RESET enable_indexonlyscan;
--
-- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 2c7327014..86e541780 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -8680,10 +8680,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
Merge Cond: (j1.id1 = j2.id1)
Join Filter: (j2.id2 = j1.id2)
-> Index Scan using j1_id1_idx on j1
- -> Index Only Scan using j2_pkey on j2
+ -> Index Scan using j2_id1_idx on j2
Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
- Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
select * from j1
inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
--
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
--
explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
SET enable_indexonlyscan = OFF;
explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
RESET enable_indexonlyscan;
--
--
2.42.0
On 21/11/2023 04:52, Peter Geoghegan wrote:
Attached is v7.
First, some high-level reactions before looking at the patch very closely:
- +1 on the general idea. Hard to see any downsides if implemented right.
- This changes the meaning of amsearcharray==true to mean that the
ordering is preserved with ScalarArrayOps, right? You change B-tree to
make that true, but what about any out-of-tree index AM extensions? I
don't know if any such extensions exist, and I don't think we should
jump through any hoops to preserve backwards compatibility here, but
probably deserves a notice in the release notes if nothing else.
- You use the term "primitive index scan" a lot, but it's not clear to
me what it means. Does one ScalarArrayOps turn into one "primitive index
scan"? Or each element in the array turns into a separate primitive
index scan? Or something in between? Maybe add a new section to the
README explain how that works.
- _bt_preprocess_array_keys() is called for each btrescan(). It performs
a lot of work like cmp function lookups and desconstructing and merging
the arrays, even if none of the SAOP keys change in the rescan. That
could make queries with nested loop joins like this slower than before:
"select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1
and tenk1.two IN (1,2);".
- nbtutils.c is pretty large now. Perhaps create a new file
nbtpreprocesskeys.c or something?
- You and Matthias talked about an implicit state machine. I wonder if
this could be refactored to have a more explicit state machine. The
state transitions and interactions between _bt_checkkeys(),
_bt_advance_array_keys() and friends feel complicated.
And then some details:
--- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage </para> </note>+ <note> + <para> + Every time an index is searched, the index's + <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield> + field is incremented. This usually happens once per index scan node + execution, but might take place several times during execution of a scan + that searches for multiple values together. Only queries that use certain + <acronym>SQL</acronym> constructs to search for rows matching any value + out of a list (or an array) of multiple scalar values are affected. See + <xref linkend="functions-comparisons"/> for details. + </para> + </note> +
Is this true even without this patch? Maybe commit this separately.
The "Only queries ..." sentence feels difficult. Maybe something like
"For example, queries using IN (...) or = ANY(...) constructs.".
* _bt_preprocess_keys treats each primitive scan as an independent piece of
* work. That structure pushes the responsibility for preprocessing that must
* work "across array keys" onto us. This division of labor makes sense once
* you consider that we're typically called no more than once per btrescan,
* whereas _bt_preprocess_keys is always called once per primitive index scan.
"That structure ..." is a garden-path sentence. I kept parsing "that
must work" as one unit, the same way as "that structure", and it didn't
make sense. Took me many re-reads to parse it correctly. Now that I get
it, it doesn't bother me anymore, but maybe it could be rephrased.
Is there _any_ situation where _bt_preprocess_array_keys() is called
more than once per btrescan?
/*
* Look up the appropriate comparison operator in the opfamily.
*
* Note: it's possible that this would fail, if the opfamily is
* incomplete, but it seems quite unlikely that an opfamily would omit
* non-cross-type comparison operators for any datatype that it supports
* at all. ...
*/
I agree that's unlikely. I cannot come up with an example where you
would have cross-type operators between A and B, but no same-type
operators between B and B. For any real-world opfamily, that would be an
omission you'd probably want to fix.
Still I wonder if we could easily fall back if it doesn't exist? And
maybe add a check in the 'opr_sanity' test for that.
In _bt_readpage():
/*
* Prechecking the page with scan keys required for direction scan. We
* check these keys with the last item on the page (according to our scan
* direction). If these keys are matched, we can skip checking them with
* every item on the page. Scan keys for our scan direction would
* necessarily match the previous items. Scan keys required for opposite
* direction scan are already matched by the _bt_first() call.
*
* With the forward scan, we do this check for the last item on the page
* instead of the high key. It's relatively likely that the most
* significant column in the high key will be different from the
* corresponding value from the last item on the page. So checking with
* the last item on the page would give a more precise answer.
*
* We skip this for the first page in the scan to evade the possible
* slowdown of point queries. Never apply the optimization with a scans
* that uses array keys, either, since that breaks certain assumptions.
* (Our search-type scan keys change whenever _bt_checkkeys advances the
* arrays, invalidating any precheck. Tracking all that would be tricky.)
*/
if (!so->firstPage && !numArrayKeys && minoff < maxoff)
{
It's sad to disable this optimization completely for array keys. It's
actually a regression from current master, isn't it? There's no
fundamental reason we couldn't do it for array keys so I think we should
do it.
_bt_checkkeys() is called in an assertion in _bt_readpage, but it has
the side-effect of advancing the array keys. Side-effects from an
assertion seems problematic.
Vague idea: refactor _bt_checkkeys() into something that doesn't have
side-effects, and have a separate function or an argument to
_bt_checkkeys() to advance to next array key. The prechecking
optimization and the Assertion could both use the side-effect-free function.
--
Heikki Linnakangas
Neon (https://neon.tech)
On 11/21/23 03:52, Peter Geoghegan wrote:
On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Thanks. Here's my review of the btree-related code:
Attached is v7.
I haven't looked at the code, but I decided to do a bit of blackbox perf
and stress testing, to get some feeling of what to expect in terms of
performance improvements, and see if there happen to be some unexpected
regressions. Attached is a couple simple bash scripts doing a
brute-force test with tables of different size / data distribution,
number of values in the SAOP expression, etc.
And a PDF visualizing the comparing the results between master and build
with the patch applied. First group of columns is master, then patched,
and then (patched/master) comparison, with green=faster, red=slower. The
columns are for different number of values in the SAOP condition.
Overall, the results look pretty good, with consistent speedups of up to
~30% for large number of values (SAOP with 1000 elements). There's a
couple blips where the performance regresses, also by up to ~30%. It's
too regular to be a random variation (it affects different combinations
of parameters, tablesizes), but it seems to only affect one of the two
machines I used for testing. Interestingly enough, it's the newer one.
I'm not convinced this is a problem we have to solve. It's possible it
only affects cases that are implausible in practice (the script forces a
particular scan type, and maybe it would not be picked in practice). But
maybe it's fixable ...
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
saop-benchmark.pdfapplication/pdf; name=saop-benchmark.pdfDownload
%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(�� G o o g l e S h e e t s)
/Title
(�� s a o p b e n c h m a r k)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x��]]�����/��������@���W�$~�.f���,�x�$3yJ�]s�H ?��YT�t���a����)��UE)�_/�a�����]���^�z��}�>��j��,���5�|��b���(�\���N�n{#�����2��~������ �v�w�t��7���/B.v3��.����G)��"��'�ok{�y<RS�Qh��J�����#3{�[BZ��[�kM�������-M��_wlJ�+�o\wt��U04�o�?��O���{8���{��1�����:� �
�|�
��o\wt�S�%]cc��p��8 �^����}�4��t�h/*�������"��5p�t|�vt����\wt�.�W06���p��8���8�U�uG��������]�� �
.�7�\wtZ�*�q���L:��E�K��+�}&�W'��,��W���p��8���<�����.@\���Y$�\wt��5Tg��p��y����+������.@ \��# ��;� g�*(�HPp��;xh\��,��;�
p���H������l�_���`�7_M����+8��Zp���@W����0�]���������p)�qcc��$��d���kx��A$H�h���P�E�uG����+������.� �~
�Y$�\wt.��k8�E�uG��_�����]�� �
���?����(���io��Tz�����������F��h����Bh���V.��jp��� �� ���.�K������Z c=�~
LQ�V��b�0>,��j\m,Z�zTk�!���p?�6�����f� `�.�(��+X�,v����I����eq)�fl� ����dy�q4o�o Q����������J�^��C#�e
���B������9Z�����%mMR���.my���N��UXpw\��,?z�,Uy+X�u�"hm��
�.�h��������V�L��1Y��������7� ��{20k��YV���Xn*e��������\*)�p��X� �+Xg�����Y�l�j`��h"4� �xk�L&���J�mXB�9,�3�$�4M� X�
�+X�o���`� X9��3���`���E���8��e10��E3��b&q�M,�4`��+i��`��@7�aC��I ����@�IA4+���&18��IH�7�f
6�l2��$�r�� ��'1�a��{f���`-��!��$������$@�����WP���l�j`�� ��VO}���b,0�h�MXB�' /���4�p�
��q'�N�`Q�H
�f�Bh\Ah�����EAx�,�=$BV�2����i~�a,���\T��H��&�q�[B�A)d�&�Rsh\Ah��~g�QD
4��X�bV����`�c�=��=P!��+��w����F�U}=4�`�%Z��'X
�5q�����W��=� y~���\'�T�?
�����������x�����f�����,?��=Ls�]"06~\�_F���d�\v��<����H��=�g�wk(�#�Qf{�6/18���Y�XH���*��q�������
�5��@��[eUi).-����-���#���,��$�V�A��`l���
��19Y�K���5���<�7���*8�x\�)|z���I��I'�kX���"i��&KQ9���-�����c��/{��������&���J6h50�cb�
��wU��
\s��I���a@�$O�i�-k���yceCi�==�-H�5�p��,o*���� %�-FB�c6�H�f���5����n��@��0�Z@���DM�����f�y������ ��l��VC�Z��� ���b^����3�1dS�23�5pt��R6�g2A6�+@��!i.{}a�����^\� �����7�p�^��Y-���R .������`�PV\a��q�d �-T ��p![n����0�]��Z HA���
��B���=@^J&T ����FF���E�F�q���)]�31
��m4i���`�W� �fv
��5T �.k/���k&�
���i��AhpJ�,}�P��h.���&�����n����x�nX�T�W06�]�{�������QK���bk{'.�� ���N��������Z�+`R2�a���+X�;�%q����������Z����u�d0�\h� 4x6I���=W�`�rZV�u��J0��0�:�af
%(���e�����^����A���P� �4K�B�
Jz�)*�f�r�����OA����K7�oYs��-F��^:�K8�[���S�����AE-�T���4���5��o1������X?-[�����`�����������q/����l���s��������~�6�������Rm�����_�{��hR�"�A�jr�&}�A�jv!>}�A�j�/��<�_�q�_yPn�J_y���2�o�s��
h�(Zy�������c�NV�`M!���DSH�Q�"O�z�}p����?���������R?�<�?\&�����������b�1�|����1�������������������|�����3�7!p�_�q����1kK2�v����fx7|��U0���O
��kg��S���=��5����Y�v��X�W�m� '~�D�z��I��g*���d��~������u2|��~7|jb���:����������j�0m�����~1�o��������N���lp�������w��&b�b���������8^������;��.Q���-��3�1����fg���@nFqD�2��t|��X�O#���L��p�����f��'�6�������fS�l=�ug��1������M]���{V�e���Sd��5��m9
%�������wn����OF@� �u)�54�|��~����-�w[j�������(�so"�y�<�o��s�� j��P�_/�(�2=Xi���5�)x_;��m�f~Pnb2���/����"��/��9�U�P�������w���)���1�+�$������n��.g-B�����6F��v(�}s6.]+�� ���izM k`���~��
� ��5|���3��T�- 9'�WP�5O���������e~=���!��vC���F^�T������+3���2*N�#���}rjo9rR���f�#=�>����f�U��5&��M�oO�Zq���}����>�>c'W�^m���
w$���W���W�C�<��sq��"f����y6,�D��:F��(c��0�����X����w��5��p7:`�o(�k��Xq�L��t!4�}����8��>�n-�F�Q�v����q~�1.������5b������[�[�Ss�2R.O�#8����3�$t�i�z$�Z�n�R&.��Y���C9��)���:��0�;�np%D���/Ad���Q��m���>��&���$4J����|��#+�GP�N>r������#���������93�$4��^.OB���B���i���o��?�r�Z���Y���qV1q�>���Yv�i�������l.�:���3Lu��r��ko=r����U���(�"N�;n���Px��y�iy��+N��Q�����pA��E����r.H��8�.H�q�~����S�/
f$�����gr������J�r�*J���Q�.Ye�Y��7���K�z�]����K�����\� ��t�3�g{����=%���(�w@vN�,G���Y�7��X�N�R���q��S����8J�*�y�,��>g�g�3���K^�� ��g�Chf��N����!�s�+i�R���Q�fn}�"j�s����/m0=�`�zJ�����������J�����ZI��n��������G��P9[�������2���;���S��13� �����E����n�2�h_BL
���u��]��L�<�I�NA�2}e���L7�"f�.Cy���S9���!(bf�Ay��g�� F���!��>�F8��(a�G�,����������K�x"������T�������t$��.��&:E&Z����q���n_|���3�B�������o�O^�2�����w����k&/$��o�����v${�� \/$���MU��7� �[ P���y}zUr�?J&H��e��XQb�G�:Y��8�O������})\�������� �����7;���/���@Vb���J�s����<���
/���Y�M$/�i�\F/�a'�#A���~�f�����������=� �R{j�
$J�g2�R���������t�N�gn�Ab�;����v��`�5s�e�>9�G�@K�r���=�y�8����b}p��������rH��f=��e��i4��3?k.��H� �{�B$J��Ewu��Ex��I���u��)�;�<� ��+/Yf��;��p�W��(3���-���r����o�F�k457b��\�0�s�:��R��P3�FE���c}=�YF�;��2��cQ��/����0����T��]�EY��i���m\pq>B��5�Z�C�k��S�������$q�\Z(}�r�n�i�/`C����4���������m/(�U����o�9��M���Tej�����$/7a�w�C�"���+�)QZ�cQ_��XD���^^'�n��^��23�9�������O��M��m�?a����7sp��rWk�����z��0���o�9d�=o�������El��w�(���rR��/���
�m_(u B������3n]���
�e�-l����0�c+��;��x���p�{�Cd)�����?1;�P*��y{�}��>������gD�4���Q#E#��uD�����
�\!�Y9�l�@ �T�Pj��>(k�����_�%�k,��;k�)F�[��$��s��:���0����>�$^�`vz�G"Q���E��m����3~�Iv��Z�G����79-s1e���>���v8/oe��'xR� �&P��C��m��c�>g�h_-
?��h�mG��q1���� g�0�G��?��q^f5O�5����0��'8���9s��x}�(�����'���U
e�$��+�]�9��nZ{�|_��?���(C
endstream
endobj
8
0
obj
5234
endobj
9
0
obj
[
]
endobj
12
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
13
0
R
/Resources
14
0
R
/Annots
16
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
13
0
obj
<<
/Filter
/FlateDecode
/Length
15
0
R
>>
stream
x��]]�$9��GToW��3��? ����~E|\�������Z@ $�>���qU�#jG���e�S��������������C���?����s�6��G������������#����[W�[[M������YB[�mf�0�������0-�`n���� �a���V#j���7��o_`���?���k�qco�����j�7��N15���O���{)����6�RH1�� �U�v%�=(�>�������%6%�fT�)��$B]+�
G�0v������D��a��>}C7V}X~����zw�iX���7����~m��3V� ��d��|k�@I>�ij��]~
C5N��cS/��F�mF��y-;��j�X"|��fuE� ��_@8�&���E�?.���K��������:�������T��w�q�I(���"�MxS�L��\���2�Q�0S���rw/�y�����3�hFY�$C�S�eE�mY%������"�
��+���)��n��&�/[��T9��1���kI�!�t�I�0�P��������B�k�w�
�"�21Wd@I����1h�����;f��������l��I�?R%L*�>���I���H�����6,&�y��9�{��l��q_�n���I���X�v}5wj���T���t��|SZ%`�U�N��a���q�U��(�d�{�D�}���������[�<����_����
���0���S����T�J�������1h� q�Z�B�C�O-O�bG��cIt�,��O�]�@6i����}��4����w�����(F-�c���)r1�r=���s�
� k��������_'��U�����B�S|cd�cZ����%�D�S������B�������(G���e�0V��6�&��YvT�%Ye�CK2�>�B�B�~b:��t}�C�w:�#������Z�I���Q�:}��K�u��)��BL��c��������F}y){u��]�\m�5�����4r�
Z%`�U��^L��N Ui�CL*�>�B53�����;��ljm.SL���_��Qt����;�"�D�Zrt����;f��}�U9�Lnj�97,�j�y���� �(T��(0�������#�](�n��<`!{C�kZm?7,B;h�y�������&�<Y������w,�D]�����C�S�u�-�!;��(8���M��h3i�y���������� �I���"�GC���^P��)>
^�\6�6��m+m�RL�/,>��L��H ���?'m�Xw����OqziW��������d��W��\�yP���bY���X�e���q�&GR;.���)�jF�]m�m����j;m��Q���}���W��q���k���J�n��+����������k���:�����5�W���j�9h�u����\��N��(�c��)����qa�U�����*cp�G�n�N��}����Lt;��k���0���fB�o����;����D�������4�r�qZ$`|��!)mhY�p��Xk^]}�^�{�����7|��)ny#<�v���(�S�Y��<"��F��������+�wq��������w���x��������U#V���v3���S,To*�]�M_�p�6�igy�M~w�:�$����:m�XH��;m�6,:0h�q�]�>#���i�����p��E`Zi���E� j]�������}�z7���4�{F�O���Z�dQ��A{�.h�s���]��wA{�
��k�}RL�/x!���i/)�R��B����^����I��u9D���6E}�M��6���.m�X�p����=�\�q�,GS���������+�}��������tp����7���hO�:�>�Z�j��o�g��
/�7�j��{�K�����e�Y�[���t�$W�1��q��eWhZv:U�X�����}\��8Q�h���Y?� {?� {O+��>������Cz�Ew�+�
�Jt�r�k9��6�X�{Q���L�X����7���m������z��2�>�:�W6����}��_Z��������NA�D�a��f��}��(�<f����Z��cY�G��L�8������P���n%U�F����q�k�����(4�~�uwj��Z�k����'Qv���#�d�5s������z.�����]�Z|cA�,=�����
qOcc��e�p_�>\#��[Z���)n��G=�7����{m�X�NK��&��'y ��J;A�%����V��w-�rWjw8��
���������p�"����;�^9�Z��C��8Zz�U O��R�3%1��C��%U�v�����X_��1���� 7�;V�g����&e�P�F��Yv�Q&@�e�����c�Gm�Yu�c�-o�����@��6�n��=�J�E9��C���,GT��(0V�c�����u*���P���F;����K8�-�F������Po��,�R�(0V��zU�%.��iQ3dpm�Y�p���+�87���tB���v1�7>�N���X�rD�A�"c%:�Q�;Q��},�i{���m��;n�� �J;�
���}��wV����I�5�J3�8���v�\Fv���E �� ���^;uK��.y�1��4S�O�r��If�G� �V��$�,GT�d����8����$�J[�
� ���Oq�\�m��Y�c�
k +�IO�b��5��`j����>��"o|�aj�U�p�3S���7�F�p���H{����N{�V��3��A�U��W�*c�z�OI��P�T���)z7=�0[;���v�����w�����tzE��Ew<h7z�Z4� 5i�T��I��{��F������
�,w���������������F`���CD�S<d�/���a��;�����Jq,�����E ��X��k�������W����x
�
����j�1�~n~��c�!��}��=8�sa}?��Q�nX<Z�V��%����Z��`�g'm�Yv�Q&@�eG~�PxV5�f=��I��Pk���4���y�v��m��>�Z���\Y�c��x� +]��$�r�����J����������jc�a�Q��n�U��4�(�T���k~�JtZ���A.`%�98h���.�^�����.�������G�1
��#��'l�a���|i�!�Q;���C���:5���y������������o~������o�XS7U->���B�n�?��V|�!���:�����a���7�^��w����^���%:��I~�
���L2
��k?�2���_����?�����>���O�
A�L�w�\���6o���_��~�����N~t���������__��9v��������o���*d�y�T�95���S���p���sq�/?���E�K�2���?��e�����>���a�O<�� k@.������kj������e��������7�)�^��=�3�_�Y�=���7���)���{������i&�hNA��q��e�2}�!S�721�Ky��)��_O��)���|����w�����hW1�y\E�.�*"�W1�� ��I����x����x���}�sQ/�W���%�rQ��������:��V�^�*�vW�O
b����Y��o8d
"�%M���LA��<s?ZSQ�f�qd�Q,��E���4l��zV�1�c�/�T�vI���N�}w������ff��q������ 1zYo��)�8g��6��C��"
_��pR���dCu����&�
r%v��eYOsa.���i�K����.���a&/,y� !j���0��}�����P�U��F��R�D��<Ib%�[+$~(���A����L�D�X��X���,*Q6��J��
�m�!�����b��u(Cl�Cb�2$.�C�L������0V��z���������SNT6D�l�?���YV6�(���s�g��
1.w�<��|e��-��,����!j����_�W��u�n�,D�R�r����X��������
�X�|c!kQ��^�ra��"�����e���`�Z�Nk�~u�s��/�O��b���#s�G�7���R���i��e���]l$"~)?�!��o,d��t��P.�}��Wv����| ?�
L\��)��{�N�h��%��K�!����f�}������=�r����/������Is�45<�iK3����zd3$�0���#��c!�9��i6��<�f��eC��R��q>g��ci����'������%���aC$f_.���
��>�l������������}��%�0q�7g4=>g������c5�u�q|�3oH�Z�y�Z��BV�`�j��X-2�a��tx-Km_����x���v�R��"~�x2#���e��
�LA�x���Q��d!�a�(�`�;� ��R��z��5d����U�jV��$�L���$�72�}�!�U�h�ae�a5G�0�"j������
c�Aj�a{Lk!��|2��s,D�9C�_6V����9���VM#j�;�n��Od�m����D���P�� dt����X BZ���Yb�C�����X�������q�U�7����"]�Hlt&f��N���vyR{����W)3��6�JQ�s���0=��(G\v(E�#D���@,d~����r�gY���@�(G�H��v�D�0�������`��|���6<Hg���l\�i'x��:����9�u�s���s.D�3������(r�q�E�7����"m�sa���\������Q[^�%�:W1��%�KQW9�q��R{�������@�����X� ��>9�C��q��9��e#]Z2�e"�lx����6������8��N<�c�O���m�k}����L����+�rZ���h��5 ��r��o�AZ�$.ODY�������0��C&�x��������X=��R����3gbJ��n�L!u���H@D��x�� ���|Ny;�C&`.�w�����L2��,��&��i�$K]z:����/���GVD�2�"��7���F��Hb~;�s8db��C&6��GCb����]FVL�0��S�Mw;
�]�<"v�A�����}j������$"q�?y$ ��x[>$~X����j�����)���g�,�5�t����>jCZ�Q���2?�<��Q4��!���xX��%�!��ld�������NM)����)�.$W�!"���,�?�_��:p�p�4�����iL�|<J���nH�1vg�l���s�;�N�i����~��ylE�.c+".��2��6!/��ER��C&�o<dR��a}�R�����<�zSS
�GE"j�����|'Z�!b~�yP�92
����#m8*�VDm9I�I����\�/�R���f��6T�r���Q!�S��?8���������~����
��#��=�=��c�����cE�c�g� ��a�
e��y�9��P��K�0��w��r:����
����������$~qxG#$~.���k�P���"��x�t��Z����g �A���Q���,��s�E�2�n�p���_�Hu���u(Cl?G
�_~�$~.^:��YC{��UfX�3,H�R�r����#.�EeC�.O�:�BV6gfQ�r������!f����Pv���k_cmC����
��6D��z�c���U�J��]*Q��\�r�V��
DT"� �J�eY�r��J���\�0���!��e��������W"�F6�����y���F��{t��������
[��b���,B9f��d6=������/\+����UH�l��cC���UH��}9b�"�<}��72�r��e�lx�+����cU��2,�{��|���c}�W�fX�c)����,�u�?z�>w����,�yB��Gi�fH�1�������kl�v�R��^��:����������{9f�<2���<6C�1���;���sq�O}�=>�f������#5�u�9bs�@����X-|c!��o0d��E�:���/�iBb��n���/��G��.��#�z^H���q+�X���1�s1��#5"6-{��C��R��/�g��u������u,����%��^Hl� �|v����,���B��� A�W���B���QCb������o��.��#��c!���<�92�r�f���F��U�{���u#b�/�������y������~Q����_������(gz�2:��Lp��!��Q�,�������?y�q����Q�u4�l���������!���^,���nu ��$./�T��� ��.u(Gl�C9b�
$6�f
Q�|c!��3��C9f��d6����@Hlzyc�D���!b�Zj�D�����Lk���&3��9�u�m9bsm��Ws����A���7�������D�������������������. ����P9`�uR{��,��e��u[�Rqr��C�Qu(�l�C���Ab�:�!������O�A�O��S���,3=t�V� �}�����u!�<t�Y�r���&�ly��+"v�����Ab�#���AQH��{9b��
$v9���{9f�d6l ��Hl{i g��8��������=^2�a5����%�s�f���]�p�����}:��
OQ�|��.���=�?$�,_)������&��
endstream
endobj
15
0
obj
6637
endobj
16
0
obj
[
]
endobj
17
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
18
0
R
/Resources
19
0
R
/Annots
21
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
18
0
obj
<<
/Filter
/FlateDecode
/Length
20
0
R
>>
stream
x��]]���q��`��������<x"Q_��08�^�>\�)�cw�����"�J���,�/�;g�:G*��XE��o�[�p��g��)L����������O�G������y�������� �;��w���-�2=�}�����c����Q�g�|T����a�0g�a������~�����$�?����M!�����"�:"��5N�������[-�%v�%���x$1�s>d3�$(O��h����K����~��O����Q����_7T�Q�x�'A�1��p��!�v'�O� n��4I� }��7��)�������#%��KJ����$�C�����")5��_��y?RARj���$��~����:A�1�� y?R{)��>���(H� �G�4^��@�!�GJ��4^��j ���)w<w2`hL�/H�C��������8��HIx�%��������#%��KJ���w�a,��`���2�W�����_x���>k��|��"H����d����_����) 2`pw5��C����'0x�@�!�GJ�^����e �^d�8`_ ���#���C�����_p�x�����$������y?R���x*{�8��HI8,�Rc_v=q���r���Rc!F�Cr�J�g9?����% }��G��|��(�;Q���������0����U�[�G�M���_3|8�n��Z�D?�����&�m�Di(����;WXr~L|q��������@��,@8� ;�Lb5M�n��4w~��OC��J*�*>�Y�p Q?�w�w�O������8��MB3>Y�0i�x��������Sp��)��qY+�A��J�d����u����q�m�%�dN9��"a%�L�e�g�����yC�����j2����T�x���2��k%w�^i&n�3^ZXcV��
g[vnS=K������2g�dR9��*c����e5r4�t��#����Y6�T��$C�5��,���m�-[l�V�z������|U���n���o&&�dR���L�X��-2�^x�2n���U���Z?��x����'V/m�:��F�5f��z]�K�<7������ML���r��U�B��k��X�::��H24^c!��R|tU�p����gP�����������>^��u?�
�sb��g2��M��K����>^c)����O >W�����n��qf� �R���Y���S�:LIf���1�y�����AN\� '��KU�;�t���~��4 /�{JtO���I�9��j���!~OI��i�S�� V��� �C����TG/��-UX�2�������[{���{��"���{��+M�/�k�
^���th��B7]Y�C�?�{�e�6��{�c�2c!���o7���^uR{�IM"4^ci�+��t�zycS{V��x�YoU� r�=L`�f�Rs�y
c���M�N11�{��{:&���J�X�������~{��MU��^���
��X^k��}��#bO��dQ����QZTc����������01�&����,0�i����llnb���U�>��x���f����&���� m���i�����"
����)�)���AZx�: K����^�]�j�"���5V�z�$�u�X�1��(K��D��b���8�&��r�9Y����1������������o���X���<9Z�I�&��l��1[�Y���S���W����Z?J4^c!�/�b8t���Y1�YV��V�n7-�&S�z9����p�4u9z�S�gN-�%��Z<Sa�����d���R��Y0[��&�"%�dLY)g���P]�����d����^������i��Y�:�P8��B��-����e�E�=e��E�R�G������4�Z�L:�x������gN�gx�d9����c��k��������!��I;Y?f��qWS?jAwP�I��k,/�5e�#����JY>�,�~-�� J��q?���:��qP��I��k,U�\��N��1��
n4^��T�t��d-7L�����x�������T�����+�<�f���o2�,�����G�5��W&�bNZ8m�1���x��,5�s|��^�s����dh��R�� {Y�%]����k�65�oj�����K"�%��k<Yg\-���eU�d�F�}��Z�6�{Y�D�.6��8�{1���50���"���^I��0�\+��H*r��O:������`��������T:��V�o�� �lc�S����A<1�����z9��z9��/�}���i�9vR[D�^L����qk�e�e������?�a<���y����qN i��>^c��F^dN�z��>^c�z-�����D�.���{����=�Y4x)�c��K����=�'�dS��N��[u��Fid'��H�_i��Nd�[(�/ew<�ds�R�v3;�\K�z�0��X��X��$�<N��T��x�G���
�,���{|�x���q�,$��d/S�������2�G��l����g��o���"�y�Y5opv��o�X��t1�d�>�N�s4Q;<;��t1�d�>`'���3&?��1��2��T���� =b&d?`��U$���*�,��P;�i�7U�� C~�9�-�4^�A��[��I3���q7i�x����^H:wy#f�Jr��TSnd�����S�0���$�(o������v���Th�x���������x�������,Eo��A�(IWo���K���(��K/l�6���5������l^^.��W��+���B�ef�8�����a���v�w�A�����,S�r���=t�^���a[
�����{��p�������mb��7��]��O#L����F�>��?�4����c����s7�O#�
�O#L������7����8���"�0K�L#���_������|�?��~�����/��]|-b o���������?>������N�;}��_��������8���
�N�xz�#�5��*��/Z�?W����b��_��+�_���0_b�X�<L�m���?�������K����\K|�e�/��J�qY���u�9~u�U�S7�"�_�~s������U�1b`��T�H���A�(�&~�&����-r�(3[�f>����h�������_W�6�
L� V`����������N�S7�������c���|��-����+����5>�25&>g����_���y&��[PF���U%=k*���L
X�8s��.u0_�-&��
�=�@\y���������~8}��1`~Y'�f/�_�s3��Z:k�F07���y�
x?�O��55����� ���)
���&�V����� o��
�k5�0������~<������+}o���]���GS����gh����O=���~rXk���Z��� ��*p�� ?��eS�xY ����S}�t�������s����;��K=o����y��i���S��O=���3r�)�ZCO����@���n�CO[C����8�������
-F�X!�syT�
L�W���~�K=o
?����P�'Tk4n�=����bm�
��xV��ve��+��YSC�x��������D���� ���ID�L!�������o�f���y25�{M� s��S�5����C���T>M
��O[b|
��:7 >���k��0�6�ac���3��e�D�������r �KF4?��^�����X���9�����n��;M
��N��w �c�)rt ���}� �K}����>[W��������'���
��~�m�A�t�v�+��&S��Ix_�wpd����y�y�5oq�k�������X[$����x?��_����K��N"n�H�x[,R#�I�~��%o�;%M����
��i*�5��1�kj�qM-�=�`
�4�6�8�kO��������=��I3�b@���E�
22`m
��� x���m
��CSK�� ��w��� x[����A| ���j���^5s�+������ �S}��<��!������'�i��6y27�C��U�=y��iJ����6k'.�ZY��dZ����K��Bl�� �m�6�x[���x����m��.?�q����x�U�k�)�<��!��5����)��_��Z�e#{��OT�������0��|w�������WU���������g&u���J"=�?�������[����4?)�w��Q�w"���by���9��D���\�����v��}���������d���D���2u4Ge�Lo; f�|%Ge�[��9Ge�k�J�a��F:��\�s��7�J�f��_�T]�;-�^Ssp����R�����9�{*y�>��s��}��������P�3��J!�9��6���Q����3�����6"�n�J������9�f�,�fC�<;3�m�J!�1��]��_��s���|�TZ�D0�C�1�����|�d����^��������v�t��J�I0�iB]������5j�����p��b��������1�|��b�E���=b��]����������m)���[��!�a�]62`���� ����I��h��[�EW����M^�lw��xu�PG����"����������G�m�-V��QJB�
��I��L]���G1�����C��w���V8�S�~���F���#9�t��:��v� ��t��f� �v��m;����������\���RF��)i_y�n8�/�[M}��y���$R3-@"�V�Z��5E������~n��A�O�j������`BJ�07L�1)55L����4 �}��-��rs��)bn�v��su���G�3wK����]�f�+�����
� ��0o�n�X�kY&���<���YLjM����'����{2 �9�vB�=��-�'75���1jw1��bn�� 1[^�O��������Mn�z�����6I��Q�r\v��-����aw��1*g9����0o�br<�ky����r����a�Q���h�r\�����x�[?|w��%s�����E��}r����8Lju�j�+�abk���b��kP����]�����5���g�j��� e����x`m�mm��� x���4^:���m�O��l_�[o��-����1��!k�([��5w�w��`D�M�Dlu�[o�}q�p����B"��CD�(�a�X�����y����������VV3���Nw���R�d�2>>��z�����Zd(Q����Z1���{���;�����{u@s���q�\�g����|w[�p��g�RO���`�������2>��_��V(�m����;�����
X��4 ~4���]PSJ��t� ��
��MJ���-;��K�&)�6H�Myw�}�~v�?�\����"��[�7rd���O��:�_� ��Z6���Jyj���EJm��R*���TV �t�-y6*�u��������J�@��
^&B���d��no� �h �$ S�T-0�)��������
endstream
endobj
20
0
obj
5491
endobj
21
0
obj
[
]
endobj
22
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
23
0
R
/Resources
24
0
R
/Annots
26
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
23
0
obj
<<
/Filter
/FlateDecode
/Length
25
0
R
>>
stream
x��]M�$�q��Q7�����X��;�WC�����������,��;�2#A>V�0�4XL?����� �df����5������0���������<^>����y�����?>Rpm@|��m3���2-��m��%�m�fU�����?����0������O��p�_Zk�m�t�����"��������|z��}�D���%�;���������L����QI^�C�h��;$.��kN���izI�bn_�8�����p7h���I\������0j���I\������4i���I\�����0�A�>�G{$q����n�������iSL��H]�$v%��*�S�+�<*�G�e�e������Jb�Q"�����~�h��8-S�%��0
s3�������^����Ern��r@�������37��������7O����A3��n?������j��95�f���8�N������j��}S����C����\9:�y`��:��Nj.K���������?JY� 0Jo�"������5�$Ks��0��X�'����,����x�a�=~�C��'�;�!�6�S�k�'T��'�%=N���>���%�c�y[�9��0)�a�����J�8QB��IH������Dy�����S,E)����I{��G;�0n�T�@�5$f-w`���)r�@���t����U_�,a� k�~u�y1Z��dQ��Y�%�P�'��!���E�^�w��G�0>��K�*��f�{`���)��2"���#�U������J�X����
��0���E�^[t/��O�e%%�M,�u�_=��)�[�7:���'��o�PR�8&�0��b�(C��GH�z�T �
�Oqn���77������N[��Y�$)o���$�^6�[� k�8h���j���J�T�}����g��-u��V��L�k�L6��>���������VyU����|���P�&�,���8���>�R�0��Z=K��������j�����eQ��o�I�������Q�9Q�0�P�k�����������.@2�_[��������)���yn$u6H�]�*aR��)_�`����g]Oex������A�:�W� k�X��9,���X�0�P������S;^2
{��%3,������YZK�l���e������QQ�Wz��e ��Oq�:vs9����q����)���8)`Y�S��}�k���w����n���U��)��7f�t��������.�2��r����E�[�r�E�}��I�����. w�~�_�{] ��!;[���a���t��a-���X�q���5 O�����l�E�m�flV������X�����k���C�^�(����QW����~��^�GU����SF]F�/������@zt���?�$��z/�Zw�n���Q����?V��)����5�"*;�c��
~�O1��5��2�n����Y�h��D��"&w�.=Y��Q$@�E��nM
�������E��,{`�9��8��1Zb��%#��C��w,z���4��b���$�5��Y����m��j�F�_�a�E�u�E�Zt����v��,��.wY�����Y��.ww,�ty��N�1�\�``Q�O��Q$`�E��A�������wY���X���B3��]��s�a<�b��z{�K`V%��0NT�� �([�+`��)���nJ�d��k�{�N����o0d���Y�p���X���K��l�.���U��T*�����g� �q�x�{gS�;�'vL���z�� NT�7��&H�-��w� ��w,�����w,��[7�0�U��W�fY�q_c-{;j�-]�.����qQ���a�:A`a��2���E�l6����.��J�X�^���,]�� �A��_���
����q�U�8�*8�C�T�q�0fY��.�3�����l�.�Y����k���������.K3<��w���3�#7']�F����j�;�X���t�:]�f|�+�z�����f�u��bj�ns�����*4�R��� �r���4��U����X�{O��S���'����S~�`F��
q���^W�n��ja�D���e ��&�pQ��
>����H�z]f�-�|�S�� ,��6d�}��{v
����f�{�Q���p��r���4������+�8eM��.F3�5o�(w�>��c���D�S<�=���(f-w�x8�Q��NX�{P�Q&`<�.����r�bx�;|'����i���������a<7���YW��J8n�a����������|Y���x�7����AW�;�����u�O1��&��ei��6�.��Z�e+u�e�R6�g�]������e�5 6C��;�����p�������([�+�(0����J�;��eU���_3�|��v��^��;���|�p�����cX�
��lT '�7���a��eU���a3,TK�����;]�fx,�������_Y�����k��%e�t�:/�Z�pAt��q������*�g]��8:���#��}�zH�?�r%?g]�F� q"z��Y4��u���#���]Gj�F�;�v���p����K��v��2���;�_X���5���2���e�2]S�XX<�2��.��,�[��dA���3,���e]�SW�������E�btm��X�,��%3<����\���a���t)�a%[��_��F��aU�E]YF���Vm/~J�-_�&T���dU���%3|�����2��r��gic���x�VZ�=:��2����:�'Y�-���Y���?^��^��,�]�K�(0��?��G��Uv��kKV%��0��m�������+���SL��;�j�.����;f��}��j,�n����[V=pT k�y�����+
P��*w�.oC���8qX�P�BFlQqGQ�� ,:�v��}�U������tmE��s�J����L�+�r?��:������������fj�b�z�C�t��ca.o�uw� �y\����7��{�+�e��<}m'Y����w�{�����;7�����|n�i����}s��*�����mo-�@������t��i�t����C��Ow���� ?�!:����t����|o�$?�����_��~����c��m��e�Bc'Z��_�������?���|yJ����������S��5��|�|���:������?;����_��tjB��}�:d�Ma�!����OOU�m�w���J����_�2,-f�|z=����5���?�������f����G�����������b�w����:��s]���9�o�QW��1)��w��������>�W����g[�A��������)��"R�D�jcz���O�*M� Y�8�A�@j
@��F���� ��k/�^��A�o��a��w'>�M���A 2o��S����m��������T����]c�x�~<�������<�j���;����s�t���$��u;����6m�c�����S�=�-��y�U��%�y?W����#�1�s}�k������� �3�m��!.�]��MS�O{�]Nm�"�?;�����
��� ���^*^�s�,�����Z�m��
o�����u|�0�d#���� �s��Kf#x��+�s6*x\�"��,�Q��J�n�b�r�+~�/�b��O�e�f �����s[�A����b���+"4������X��%�x�Y$5�k�b
4�g����\;����j��o�~��V0���E~��O�K1��k�Z�� ����9�A^���-�)M��fX��%^�����^�s��/2���2��,3P��^�e����|� �|Cl������b6����� o�z�\)[�����M� ��J��^`���d8�������^�GC
��2��!�BE3�[�f��0^���$ om�9��k�Op�!f��C�����"���%>b��������+�Q;�������9��O��];t�uf{�!����c(�:t��8�bR� �X]b�����?x��^�X.��j�ex��=b2{���� �y����|�'`v <��\2� q���<W/d���!{�MC���6B���gND��+�
<q�����H������E��3��!M ���i2M b�4���L3�����w�(��86<�g�lH@1M@���L
��|�0`�/�D����CM2� ��e��v�<��2�
f�W�0����1��.En���\�=�\��P�����q���A�z���~�����}�|�����a�
?9���� D�G��y�����
���%� o�)1x���2n�W�J^�Q_3d�A3>���1�����A��\=/���������~���~���O[ ������y=]��LOW B�r;c=��X��WbE������X}���+���W��Q_#d�,W�"�>�����+�
��t��z��F��8�W����]l�
��:�����K�|
��eSL>��%� ^���H>#����a�|\-��4a^o��D� �L���.�|C�����WX�-m|�}{C��
$5�� �Kj+���U��~��
�~�8��k�Lm��"�,��T��
px�_������\���`�\_F�|x?Y��2��'c�A����kM?��#� ^C����"����3�������?��� ��r�tf�b�YQ.��<9<2��=r���Eb9l��c�G(��r��o�!V������H�7����A._#d*8a.�JV��?�� /�c���=j/�k|��0>��'bv�9��s��p�B���2�\��1W����ks��-gJ8���k����sy���0�y��.��u) �����F� vuBq����i���7���Fx��Wl8��Q'2�G�!^�m�����N%#���"�JVTf�~j����qR3����v��}
�2a��@OP�<�f��S����C5b��oco��m����� _#d�puB�
_+x�]�7�+�z+�����Z�qx��O��u��u��'&��D�����������fXO~�����i�\b�>t���l|c�8=�����<�V�h�.So��������r�� 9�b�A .w�� �8�B^�����K�����]�]��q��r�8F\���Yp�>k-Xa�������;s�:�\_1ML
>�?4���1=+��
gl��Lb�x3)t���j���Q ���[vj�|���c�6}Mi��r��<��e��t�{a����K���0 f��;�k��NS���}���J��� o��/r�}�x�"~��[�����A���"6�c����o���p�Af�+��q"1�NO$>������������W0bR�<���� ��<��z<|���_�h~2��c��\M�9���=�A�-/���0�T��&���;(0��,�ZgY���D�� ��,��/6�x_�'2b�����c��\M���r�Y�@�e���h<b��;1��{l�+�/wc���n�.�Cc��&�����o�!V��x_��(2���ex�7�z!�O���{����1I@/\� ����0����s�����~=!�}��p��_M��s��z![@8��fX��`/,�3m�a�g�{��LI���gJ��/����a�z,O �����d4���������2�f���a�51�� "f�a�Z���QG3��0
x�{�"b��M^�����W0��M���P����l����K�����/�xjb����L�2�t*r���=����~�N& @�jDE�p�B& x���F�{�_{��Gg�3`vI��x�s�:���]Fg���c��^���W����!T4���z����q��ab��;�K���"bR� �X=J\�[?n�X��OON��0�%3j�t����_,��X�^x�� ��1�b^� ����
���cE�.�("~��[F��2�
f����fX����
��C(����N�+�B�ay�C�nw�X:�s��8��@�[����7�_�~�zF~���p�e:���D�����7&�o��<?~�x_�|l����}�EZ.����sm�����c��LT�<c����"��7��dx?Z_����:6��]��/oL>��61� ^�S��F���K,�O���| ��������0#�%��91��?9�
�~�.I� Uh�JGLj�J!V�w��x����[j1�!�gC��l�k��f��1���k+\�f�����Ja^�Sf4�B�S)���������W����h�H>��%�x����kN>���>�D�q5B&_b�|
�����9<��y
/���0�$��1�B���G,x�!?ifz�:-x!j��buIh^���}�_4y���w
��^�����y�1��V?� �G���_��?��VW
endstream
endobj
25
0
obj
6415
endobj
26
0
obj
[
]
endobj
27
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
28
0
R
/Resources
29
0
R
/Annots
31
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
28
0
obj
<<
/Filter
/FlateDecode
/Length
30
0
R
>>
stream
x��]M�$�q��P7������}P;?���U��������{��I�,�2$��w~1�|��fD�`1��X�e3�2���s����??�~�0���������������������c�$�� �-��{��i��s��3K������7m�?���g8����~L���]�7������������E�����m7���� ����O�f5�O�}�����*H��.��$�;/$r�_�p}+$6�$�e�c{$��1}e��AP�x��
��s�_����v����R"�{�K���"����9vW
�}��H��K���/t�����D��ihE��8�G}�B"���n���}�o����v��8�zl����%�>$e����9�R�X��v��x�$e����$�IJD�OM�9��/Hl_y;RF�7R"�M���W�����4��q�����)#���x����W��������-��v��qh���qlAb����2�~�9��/�k�i����k�o���4����:����O|��i�$�������������,G���5�����;���\_�q}�;/T�4�F<�8���m�i��q5"���n�#y�[_$��I&`,d����i��l���������Bv{D�l�Q&��q���M�vyr�����$�;N�������8U�7��9'i^� Y�z��L�r�^�w��y�I�74��9�w���/.9
�$MA�x�G����9���h�&I-�����N8�F�N����3Q�n�i���V����<
�NA�w��l�s����,]8fO�|!�2��L�p��Y����s�6��p%���x ���� �]�[m��z� 3p%����Db��;eVe���h�V����f��As�2q��m�!�8��Ae"�D�B�������N������x��Is�ey�B�L�I��4�w��9:n��`���i�!ku�����%���P��5r��g�w�;R��;i��{�u1������6��}�}�u��b�6
I+G]����&�������7����-�FEb�������}jK8�Ln�/����s�G{n'3'��W�j�Ru������T�6�< ���E'FY�4�
$� q&��>��u��D/�=���Ieo�������
�����m�At��Y��4�����&��k���X�&������Ll�c!{m0�~D��h�$0���9'�R7����s,
|=RG��c{��
������eC�=�nI7x������X���,]���q��������� Y�zt}�������AtnT���+a������s���l�K}s��a����1DnC���.%������H��X�N����4%�f��Q�zPd�L%`,T���~�[�c��(J(�I$���������O��^�gv�H�8��iyL���t$K6i9� K�����@��Q��*W^Ieo�c�zX��D�cf�N��9n��s�$�d�D�R��5�$���!�L4g��{�6q ����d�9�sp'�a
�ln����J&�IxR U�y�xm�#��IeN2�}�;���:'����<�e�}�N��ivV�P��d&�e[�I&`,d}JK�b������.b��"�Jl�cwL�/[w��x7�D��d�}�cM"N�dL��'���P���?���v;s�f�]��J2�}�/l��}d��M2���d���D�}�='�rw��oR K�����l+���rIfo��.���S?63!c�\x�����}���m#��-e&�D�Bt��:�� �`$+�$�b���d���c ���9�Rf�T;������c��d��2M2c){��\��Y�@�v�0����=��c� ���|�e�y��h�����bW�t���H&�������d:J��2�<`){k�f�}V���2�<`i�+1�:J&�=U{�x�I�_*���"�>+(!��>�L�=�e��d�d�R����>�|����~��Q������gI�"
�v�:��}��}��I��)��1-�'e��K�
���>����9/�R/��7����_&�R������������\+S��Y����"�b�Q�RV���8+<�������;�e~}���R��7���I6�^f�,do�K�Uu�����2�>��&�v���]V+��_p/R�}�u�E��d�M�2�>`!s>1�Ll�&��l�e5mn����[e3��+�n����J�����<���x��T���h3�l�l�^&�,e��Z�=6rgE}.+�s^&�,TK�������y�r� S��"�M��i>�L�e���
��0�|=#-�3\V����Fd�}�B���8�$���i�b�k�r�J�IxR KU�2�S[������j:]V�9�
>�]8�������I���$|H�l�
O�G�7`�uf^�%s�<<�����kFCV�:d�C�d�X����(Mf�C?H#�]/�{����[�5�&��8�F�6�0��W#�d2'�����L��E�a� ���u�L�x9���t�J���pR U^RA?��
�%k��x�2>`&{[��2S�0�n���{���V8G�dL��&����@{e�T��2Q�LLX�*�i����{j2��n8�6t�io�����-��Q2YM����N&�9�d��]c_�l�t�a{�s�Bm���dq���N&����3I5x��c'3����?���
�w��2_$���[l��D����i`�H����>����\|d�l�Rv���X��P�&�F��O���9f�7�u�c+�;���qc&���x�6��P�����<�G�s<��8��Qf�$��w��s������L�6�n6�O� ���[D�;9��0�M�wi�y��hy���s�t�����oS�h�dl�L6-�b<������'9I�pRM';�c������1t�L��c�7�E���:ur��0��r������ed��p�Ht�0���P�~�d���;�2�pQ?L[`!U:�)�}R KU�h�#*]��f�d�=`��h/]�`R�O���9������|�(�1��':g���ql��(�T���&��}�3����KV��uI'@���8G�<��af�����9n�
�*��\6#��>��\���ss�v��/�u��?�?��9�QL����Y�:���+$��i���O����n�>uM�?� }:4��A�tl����b�����]���U(�}�I\�eOr�l�ilDs�����������W��������I��el�C�c���i������<�������~{��_�_�r�����N]������:}��
���*���/�~�y�x�fNl�a����v������~�@�q&�PG���-��0_���gu�n~��T`~>}��]\���4���� b?G����;}_E�~�:1`5q����������M
���@\y�����j3'�����N�M��~9}�����5����2�\��n� ���������!���+]�|�7E���C�3L>�_���s���<���u�<��9�L�����R�r<�U���Z=*CV�Q�UG���z��#��!x�0��S�N��<|�7��R=Qe�����<;�nL�S�;������
��d�~������{\�X�q��c�;���w�_�>�_����� �s���q2Of��a��8q�W��B���i*sbSCp'6�wb@�L�_�Ny
�Y��<�Q;1�}��n����N�l�s^���G����!���Z��\���*�}�6!o���<��{\����m8l~3{����~}�����~u��J'\���o���u�>�#B��[�n��C�?��u�����V�����6G�s�.��,p��Y"VND���u��&+��5�g��Ku�K!�>��ZRHF�O���x��t��`V�|lyS�)��������cj|l�Y�)�� VL`��w��/��L?.��&D��6;Y_��D�u�����j"����F� o��0����~M���U��d^U�U���6oD����B���yh3�0mbmE$�P�i^yM�U��"�E�&��~���x�m�!>��$�x�����>�WS������y�1�0>b}��&��>�'_)� f��m
>���`e�~�����_�X�U�<�� �Z����/�j�
��D�~�()�5�;b}>���|!^�Zl�l��x�k�w��{�q"f�+�j}��i���\�X�s%ST!�C���UO����E����p����70�v�D�&N\��f������!���vD�9����W��\��Wg1�_�#�p�JK�l�q^��^�"-[Cp�3��8@�9�@x-V�0�>�E���������j5�X(�RYDm0"V��P�UO������<>����`
������0[qb���z��p�"N�l�s^������Z����!�����J��\�O>xm�����A^�����e����jN�j��%^��������`N\"V������0�}�O:����z�}����-�c>�M|�����"^�d����
���*"�8q�y��"f�}��>�����m;�`t����= B�lD[��X-*��yS`+����l(S�L���2���@����X��<F����������X�G�����t��<�lzlyS�)�j#�M�)\�>��Z����z�����Hc�$���� f��x��\c�"��v"�X-J�K��hV���� ^�Rc[C�hfL��Y�X����?(���b;ju$��Z=���y���W�!^����=�����E"�l1�2�M����M����+[C�HdK�"Q�X� �����9�gV%f�}s})$?��>���/���_�c��h�Q����IH+��C�5��5�<b�b"~T,����ZTo"�g�� x����v���7���x��3���R����
��5��)���B�������-�����z��9��-��T=hV.��W$ �����!����u@lQ��x-^.�x�������������
���+����x��T�k�l
�<�D���"bMeV�8�k1h"^�C�_����#CK�4c1�4Jc�~DF�&��������7�����>>