Adding skip scan (including MDAM style range skip scan) to nbtree
Attached is a POC patch that adds skip scan to nbtree. The patch
teaches nbtree index scans to efficiently use a composite index on
'(a, b)' for queries with a predicate such as "WHERE b = 5". This is
feasible in cases where the total number of distinct values in the
column 'a' is reasonably small (think tens or hundreds, perhaps even
thousands for very large composite indexes).
In effect, a skip scan treats this composite index on '(a, b)' as if
it was a series of subindexes -- one subindex per distinct value in
'a'. We can exhaustively "search every subindex" using an index qual
that behaves just like "WHERE a = ANY(<every possible 'a' value>) AND
b = 5" would behave.
This approach might be much less efficient than an index scan that can
use an index on 'b' alone, but skip scanning can still be orders of
magnitude faster than a sequential scan. The user may very well not
have a dedicated index on 'b' alone, for whatever reason.
Note that the patch doesn't just target these simpler "skip leading
index column omitted from the predicate" cases. It's far more general
than that -- skipping attributes (or what the patch refers to as skip
arrays) can be freely mixed with SAOPs/conventional arrays, in any
order you can think of. They can also be combined with inequalities to
form range skip arrays.
This patch is a direct follow-up to the Postgres 17 work that became
commit 5bf748b8. Making everything work well together is an important
design goal here. I'll talk more about that further down, and will
show a benchmark example query that'll give a good general sense of
the value of the patch with these more complicated cases.
A note on terminology
=====================
The terminology in this area has certain baggage. Many of us will
recall the patch that implemented loose index scan. That patch also
dubbed itself "skip scan", but that doesn't seem right to me (it's at
odds with how other RDBMSs describe features in this area). I would
like to address the issues with the terminology in this area now, to
avoid creating further confusion.
When I use the term "skip scan", I'm referring to a feature that's
comparable to the skip scan features from Oracle and MySQL 8.0+. This
*isn't* at all comparable to the feature that MySQL calls "loose index
scan" -- don't confuse the two features.
Loose index scan is a far more specialized technique than skip scan.
It only applies within special scans that feed into a DISTINCT group
aggregate. Whereas my skip scan patch isn't like that at all -- it's
much more general. With my patch, nbtree has exactly the same contract
with the executor/core code as before. There are no new index paths
generated by the optimizer to make skip scan work, even. Skip scan
isn't particularly aimed at improving group aggregates (though the
benchmark I'll show happens to involve a group aggregate, simply
because the technique works best with large and expensive index
scans).
My patch is an additive thing, that speeds up what we'd currently
refer to as full index scans (as well as range index scans that
currently do a "full scan" of a range/subset of an index). These index
paths/index scans can no longer really be called "full index scans",
of course, but they're still logically the same index paths as before.
MDAM and skip scan
==================
As I touched on already, the patch actually implements a couple of
related optimizations. "Skip scan" might be considered one out of the
several optimizations from the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1]https://vldb.org/conf/1995/P710.PDF -- Peter Geoghegan -- the paper describes skip scan under
its "Missing Key Predicate" subsection. I collectively refer to the
optimizations from the paper as the "MDAM techniques".
Alternatively, you could define these MDAM techniques as each
implementing some particular flavor of skip scan, since they all do
rather similar things under the hood. In fact, that's how I've chosen
to describe things in my patch: it talks about skip scan, and about
range skip scan, which is considered a minor variant of skip scan.
(Note that the term "skip scan" is never used in the MDAM paper.)
MDAM is short for "multidimensional access method". In the context of
the paper, "dimension" refers to dimensions in a decision support
system. These dimensions are represented by low cardinality columns,
each of which appear in a large composite B-Tree index. The emphasis
in the paper (and for my patch) is DSS and data warehousing; OLTP apps
typically won't benefit as much.
Note: Loose index scan *isn't* described by the paper at all. I also
wouldn't classify loose index scan as one of the MDAM techniques. I
think of it as being in a totally different category, due to the way
that it applies semantic information. No MDAM technique will ever
apply high-level semantic information about what is truly required by
the plan tree, one level up. And so my patch simply teaches nbtree to
find the most efficient way of navigating through an index, based
solely on information that is readily available to the scan. The same
principles apply to all of the other MDAM techniques; they're
basically all just another flavor of skip scan (that do some kind of
clever preprocessing/transformation that enables reducing the scan to
a series of disjunctive accesses, and that could be implemented using
the new abstraction I'm calling skip arrays).
The paper more or less just applies one core idea, again and again.
It's surprising how far that one idea can take you. But it is still
just one core idea (don't overlook that point).
Range skip scan
---------------
To me, the most interesting MDAM technique is probably one that I
refer to as "range skip scan" in the patch. This is the technique that
the paper introduces first, in its "Intervening Range Predicates"
subsection. The best way of explaining it is through an example (you
could also just read the paper, which has an example of its own).
Imagine a table with just one index: a composite index on "(pdate,
customer_id)". Further suppose we have a query such as:
SELECT * FROM payments WHERE pdate BETWEEN '2024-01-01' AND
'2024-01-30' AND customer_id = 5; -- both index columns (pdate and
customer_id) appear in predicate
The patch effectively makes the nbtree code execute the index scan as
if the query had been written like this instead:
SELECT * FROM payments WHERE pdate = ANY ('2024-01-01', '2024-01-02',
..., '2024-01-30') AND customer_id = 5;
The use of a "range skip array" within nbtree allows the scan to skip
when that makes sense, locating the next date with customer_id = 5
each time (we might skip over many irrelevant leaf pages each time).
The scan must also *avoid* skipping when it *doesn't* make sense.
As always (since commit 5bf748b8 went in), whether and to what extent
we skip using array keys depends in large part on the physical
characteristics of the index at runtime. If the tuples that we need to
return are all clustered closely together, across only a handful of
leaf pages, then we shouldn't be skipping at all. When skipping makes
sense, we should skip constantly.
I'll discuss the trade-offs in this area a little more below, under "Design".
Using multiple MDAM techniques within the same index scan (includes benchmark)
------------------------------------------------------------------------------
I recreated the data in the MDAM paper's "sales" table by making
inferences from the paper. It's very roughly the same data set as the
paper (close enough to get the general idea across). The table size is
about 51GB, and the index is about 25GB (most of the attributes from
the table are used as index columns). There is nothing special about
this data set -- I just thought it would be cool to "recreate" the
queries from the paper, as best I could. Thought that this approach
might make my points about the design easier to follow.
The index we'll be using for this can be created via: "create index
mdam_idx on sales_mdam_paper(dept, sdate, item_class, store)". Again,
this is per the paper. It's also the order that the columns appear in
every WHERE clause in every query from the paper.
(That said, the particular column order from the index definition
mostly doesn't matter. Every index column is a low cardinality column,
so unless the order used completely obviates the need to skip a column
that would otherwise need to be skipped, such as "dept", the effect on
query execution time from varying column order is in the noise.
Obviously that's very much not how users are used to thinking about
composite indexes.)
The MDAM paper has numerous example queries, each of which builds on
the last, adding one more complication each time -- each of which is
addressed by another MDAM technique. The query I'll focus on here is
an example query that's towards the end of the paper, and so combines
multiple techniques together -- it's the query that appears in the "IN
Lists" subsection:
select
dept,
sdate,
item_class,
store,
sum(total_sales)
from
sales_mdam_paper
where
-- omitted: leading "dept" column from composite index
sdate between '1995-06-01' and '1995-06-30'
and item_class in (20, 35, 50)
and store in (200, 250)
group by dept, sdate, item_class, store
order by dept, sdate, item_class, store;
On HEAD, when we run this query we either get a sequential scan (which
is very slow) or a full index scan (which is almost as slow). Whereas
with the patch, nbtree will execute the query as a succession of a few
thousand very selective primitive index scans (which usually only scan
one leaf page, though some may scan two neighboring leaf pages).
Results: The full index scan on HEAD takes about 32 seconds. With the
patch, the query takes just under 52ms to execute. That works out to
be about 630x faster with the patch.
See the attached SQL file for full details. It provides all you'll
need to recreate this test result with the patch.
Nobody would put up with such an inefficient full index scan in the
first place, so the behavior on HEAD is not really a sensible baseline
-- 630x isn't very meaningful. I could have come up with a case that
showed an even larger improvement if I'd felt like it, but that
wouldn't have proven anything.
The important point is that the patch makes a huge composite index
like the one I've built for this actually make sense, for the first
time. So we're not so much making something faster as enabling a whole
new approach to indexing -- particularly for data warehousing use
cases. The way that Postgres DBAs choose which indexes they'll need to
create is likely to be significantly changed by this optimization.
I'll break down how this is possible. This query makes use of 3
separate MDAM techniques:
1. A "simple" skip scan (on "dept").
2. A "range" skip scan (on "sdate").
3. The pair of IN() lists/SAOPs on item_class and on store. (Nothing
new here, except that nbtree needs these regular SAOP arrays to roll
over the higher-order skip arrays to trigger moving on to the next
dept/date.)
Internally, we're just doing a series of several thousand distinct
non-overlapping accesses, in index key space order (so as to preserve
the appearance of one continuous index scan). These accesses starts
out like this:
dept=INT_MIN, date='1995-06-01', item_class=20, store=200
(Here _bt_advance_array_keys discovers that the actual lowest dept
is 1, not INT_MIN)
dept=1, date='1995-06-01', item_class=20, store=200
dept=1, date='1995-06-01', item_class=20, store=250
dept=1, date='1995-06-01', item_class=35, store=200
dept=1, date='1995-06-01', item_class=35, store=250
...
(Side note: as I mentioned, each of the two "store" values usually
appear together on the same leaf page in practice. Arguably I should
have shown 2 lines/accesses here (for "dept=1"), rather than showing
4. The 4 "dept=1" lines shown required only 2 primitive index
scans/index descents/leaf page reads. Disjunctive accesses don't
necessarily map 1:1 with primitive/physical index scans.)
About another ten thousand similar accesses occur (omitted for
brevity). Execution of the scan within nbtree finally ends with these
primitive index scans/accesses:
...
dept=100, date='1995-06-30', item_class=50, store=250
dept=101, date='1995-06-01', item_class=20, store=200
STOP
There is no "dept=101" entry in the index (the highest department in
the index happens to be 100). The index scan therefore terminates at
this point, having run out of leaf pages to scan (we've reached the
rightmost point of the rightmost leaf page, as the scan attempts to
locate non-existent dept=101 tuples).
Design
======
Since index scans with skip arrays work just like index scans with
regular arrays (as of Postgres 17), naturally, there are no special
restrictions. Associated optimizer index paths have path keys, and so
could (just for example) appear in a merge join, or feed into a group
aggregate, while avoiding a sort node. Index scans that skip could
also feed into a relocatable cursor.
As I mentioned already, the patch adds a skipping mechanism that is
purely an additive thing. I think that this will turn out to be an
important enabler of using the optimizations, even when there's much
uncertainty about how much they'll actually help at runtime.
Optimizer
---------
We make a broad assumption that skipping is always to our advantage
during nbtree preprocessing -- preprocessing generates as many skip
arrays as could possibly make sense based on static rules (rules that
don't apply any kind of information about data distribution). Of
course, skipping isn't necessarily the best approach in all cases, but
that's okay. We only actually skip when physical index characteristics
show that it makes sense. The real decisions about skipping are all
made dynamically.
That approach seems far more practicable than preempting the problem
during planning or during nbtree preprocessing. It seems like it'd be
very hard to model the costs statistically. We need revisions to
btcostestimate, of course, but the less we can rely on btcostestimate
the better. As I said, there are no new index paths generated by the
optimizer for any of this.
What do you call an index scan where 90% of all index tuples are 1 of
only 3 distinct values, while the remaining 10% of index tuples are
all perfectly unique in respect of a leading column? Clearly the best
strategy when skipping using the leading column to "use skip scan for
90% of the index, and use a conventional range scan for the remaining
10%". Skipping generally makes sense, but we legitimately need to vary
our strategy *during the same index scan*. It makes no sense to think
of skip scan as a discrete sort of index scan.
I have yet to prove that always having the option of skipping (even
when it's very unlikely to help) really does "come for free" -- for
now I'm just asserting that that's possible. I'll need proof. I expect
to hear some principled skepticism on this point. It's probably not
quite there in this v1 of the patch -- there'll be some regressions (I
haven't looked very carefully just yet). However, we seem to already
be quite close to avoiding regressions from excessive/useless
skipping.
Extensible infrastructure/support functions
-------------------------------------------
Currently, the patch only supports skip scan for a subset of all
opclasses -- those that have the required support function #6, or
"skip support" function. This provides the opclass with (among other
things) a way to increment the current skip array value (or to
decrement it, in the case of backward scans). In practice we only have
this for a handful of discrete integer (and integer-ish) types. Note
that the patch currently cannot skip for an index column that happens
to be text. Note that even this v1 supports skip scans that use
unsupported types, provided that the input opclass of the specific
columns we'll need to skip has support.
The patch should be able to support every type/opclass as a target for
skipping, regardless of whether an opclass support function happened
to be available. That could work by teaching the nbtree code to have
explicit probes for the next skip array value in the index, only then
combining that new value with the qual from the input scan keys/query.
I've put that off for now because it seems less important -- it
doesn't really affect anything I've said about the core design, which
is what I'm focussing on for now.
It makes sense to use increment/decrement whenever feasible, even
though it isn't strictly necessary (or won't be, once the patch has
the required explicit probe support). The only reason to not apply
increment/decrement opclass skip support (that I can see) is because
it just isn't practical (this is generally the case for continuous
types). While it's slightly onerous to have to invent all this new
opclass infrastructure, it definitely makes sense.
There is a performance advantage to having skip arrays that can
increment through each distinct possible indexable value (this
increment/decrement stuff comes from the MDAM paper). The MDAM
techniques inherently work best when "skipping" columns of discrete
types like integer and date, which is why the paper has examples that
all look like that. If you look at my example query and its individual
accesses, you'll realize why this is so.
Thoughts?
[1]: https://vldb.org/conf/1995/P710.PDF -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
v1-0001-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v1-0001-Add-skip-scan-to-nbtree.patchDownload
From 932bb757f5e8b5ee23c842438da93d39d8b4a1a7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 16 Apr 2024 13:21:36 -0400
Subject: [PATCH v1] Add skip scan to nbtree.
Skip scan allows nbtree index scans to efficiently use a composite index
on an index (a, b) for queries with a predicate such as "WHERE b = 5".
This is useful in cases where the total number of distinct values in the
column 'a' is reasonably small (think hundreds, possibly thousands).
In effect, a skip scan treats the composite index on (a, b) as if it was
a series of disjunct subindexes -- one subindex per distinct 'a' value.
We exhaustively "search every subindex" using a qual that behaves like
"WHERE a = ANY(<every possible 'a' value>) AND b = 5".
---
src/include/access/nbtree.h | 16 +-
src/include/catalog/pg_amproc.dat | 16 +
src/include/catalog/pg_proc.dat | 24 +
src/include/utils/skipsupport.h | 140 ++
src/backend/access/nbtree/nbtcompare.c | 199 +++
src/backend/access/nbtree/nbtree.c | 5 +-
src/backend/access/nbtree/nbtutils.c | 1378 +++++++++++++++++--
src/backend/access/nbtree/nbtvalidate.c | 4 +
src/backend/commands/opclasscmds.c | 25 +
src/backend/utils/adt/Makefile | 1 +
src/backend/utils/adt/date.c | 34 +
src/backend/utils/adt/meson.build | 1 +
src/backend/utils/adt/selfuncs.c | 30 +-
src/backend/utils/adt/skipsupport.c | 54 +
src/backend/utils/adt/uuid.c | 65 +
src/backend/utils/misc/guc_tables.c | 12 +
doc/src/sgml/btree.sgml | 13 +
doc/src/sgml/xindex.sgml | 16 +-
src/test/regress/expected/alter_generic.out | 6 +-
src/test/regress/expected/psql.out | 3 +-
src/test/regress/sql/alter_generic.sql | 2 +-
src/tools/pgindent/typedefs.list | 3 +
22 files changed, 1882 insertions(+), 165 deletions(-)
create mode 100644 src/include/utils/skipsupport.h
create mode 100644 src/backend/utils/adt/skipsupport.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 749304334..81e99fcc1 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+#include "utils/skipsupport.h"
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -709,7 +710,8 @@ BTreeTupleGetMaxHeapTID(IndexTuple itup)
#define BTINRANGE_PROC 3
#define BTEQUALIMAGE_PROC 4
#define BTOPTIONS_PROC 5
-#define BTNProcs 5
+#define BTSKIPSUPPORT_PROC 6
+#define BTNProcs 6
/*
* We need to be able to tell the difference between read and write
@@ -1032,9 +1034,15 @@ typedef BTScanPosData *BTScanPos;
typedef struct BTArrayKeyInfo
{
int scan_key; /* index of associated key in keyData */
+ int num_elems; /* number of elems (-1 for skip array) */
+
+ /* State used by standard arrays that store elements in memory */
int cur_elem; /* index of current element in elem_values */
- int num_elems; /* number of elems in current array value */
Datum *elem_values; /* array of num_elems Datums */
+
+ /* State used by skip arrays, which generate elements procedurally */
+ SkipSupportData sksup; /* opclass skip scan support */
+ bool null_elem; /* lowest/highest element actually NULL? */
} BTArrayKeyInfo;
typedef struct BTScanOpaqueData
@@ -1123,6 +1131,7 @@ typedef struct BTReadPageState
*/
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
+#define SK_BT_SKIP 0x00040000 /* SK_SEARCHARRAY skip scan key */
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1159,6 +1168,9 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+/* GUC parameter (just a temporary convenience for reviewers) */
+extern PGDLLIMPORT int skipscan_prefix_cols;
+
/*
* external entry points for btree, in nbtree.c
*/
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index f639c3a6a..2a8f6f3f1 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -21,6 +21,8 @@
amprocrighttype => 'bit', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '1', amproc => 'btboolcmp' },
+{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
+ amprocrighttype => 'bool', amprocnum => '6', amproc => 'btboolskipsupport' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
@@ -41,12 +43,16 @@
amprocrighttype => 'char', amprocnum => '1', amproc => 'btcharcmp' },
{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'btcharskipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1', amproc => 'date_cmp' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '2', amproc => 'date_sortsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'date_skipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'timestamp', amprocnum => '1',
amproc => 'date_cmp_timestamp' },
@@ -122,6 +128,8 @@
amprocrighttype => 'int2', amprocnum => '2', amproc => 'btint2sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'btint2skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint24cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
@@ -141,6 +149,8 @@
amprocrighttype => 'int4', amprocnum => '2', amproc => 'btint4sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'btint4skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint48cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
@@ -160,6 +170,8 @@
amprocrighttype => 'int8', amprocnum => '2', amproc => 'btint8sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'btint8skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint84cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
@@ -193,6 +205,8 @@
amprocrighttype => 'oid', amprocnum => '2', amproc => 'btoidsortsupport' },
{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'btoidskipsupport' },
{ amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
amprocrighttype => 'oidvector', amprocnum => '1',
amproc => 'btoidvectorcmp' },
@@ -261,6 +275,8 @@
amprocrighttype => 'uuid', amprocnum => '2', amproc => 'uuid_sortsupport' },
{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'uuid_skipsupport' },
{ amprocfamily => 'btree/record_ops', amproclefttype => 'record',
amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordcmp' },
{ amprocfamily => 'btree/record_image_ops', amproclefttype => 'record',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6a5476d3c..888a4893c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1004,18 +1004,27 @@
{ oid => '3129', descr => 'sort support',
proname => 'btint2sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint2sortsupport' },
+{ oid => '9290', descr => 'skip support',
+ proname => 'btint2skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint2skipsupport' },
{ oid => '351', descr => 'less-equal-greater',
proname => 'btint4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int4 int4', prosrc => 'btint4cmp' },
{ oid => '3130', descr => 'sort support',
proname => 'btint4sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint4sortsupport' },
+{ oid => '9291', descr => 'skip support',
+ proname => 'btint4skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint4skipsupport' },
{ oid => '842', descr => 'less-equal-greater',
proname => 'btint8cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int8 int8', prosrc => 'btint8cmp' },
{ oid => '3131', descr => 'sort support',
proname => 'btint8sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint8sortsupport' },
+{ oid => '9292', descr => 'skip support',
+ proname => 'btint8skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint8skipsupport' },
{ oid => '354', descr => 'less-equal-greater',
proname => 'btfloat4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'float4 float4', prosrc => 'btfloat4cmp' },
@@ -1034,12 +1043,18 @@
{ oid => '3134', descr => 'sort support',
proname => 'btoidsortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btoidsortsupport' },
+{ oid => '9293', descr => 'skip support',
+ proname => 'btoidskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btoidskipsupport' },
{ oid => '404', descr => 'less-equal-greater',
proname => 'btoidvectorcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'oidvector oidvector', prosrc => 'btoidvectorcmp' },
{ oid => '358', descr => 'less-equal-greater',
proname => 'btcharcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'char char', prosrc => 'btcharcmp' },
+{ oid => '9294', descr => 'skip support',
+ proname => 'btcharskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btcharskipsupport' },
{ oid => '359', descr => 'less-equal-greater',
proname => 'btnamecmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'name name', prosrc => 'btnamecmp' },
@@ -2214,6 +2229,9 @@
{ oid => '3136', descr => 'sort support',
proname => 'date_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'date_sortsupport' },
+{ oid => '9295', descr => 'skip support',
+ proname => 'date_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'date_skipsupport' },
{ oid => '4133', descr => 'window RANGE support',
proname => 'in_range', prorettype => 'bool',
proargtypes => 'date date interval bool bool',
@@ -4368,6 +4386,9 @@
{ oid => '1693', descr => 'less-equal-greater',
proname => 'btboolcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'bool bool', prosrc => 'btboolcmp' },
+{ oid => '9296', descr => 'skip support',
+ proname => 'btboolskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btboolskipsupport' },
{ oid => '1688', descr => 'hash',
proname => 'time_hash', prorettype => 'int4', proargtypes => 'time',
@@ -9175,6 +9196,9 @@
{ oid => '3300', descr => 'sort support',
proname => 'uuid_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'uuid_sortsupport' },
+{ oid => '9297', descr => 'skip support',
+ proname => 'uuid_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'uuid_skipsupport' },
{ oid => '2961', descr => 'I/O',
proname => 'uuid_recv', prorettype => 'uuid', proargtypes => 'internal',
prosrc => 'uuid_recv' },
diff --git a/src/include/utils/skipsupport.h b/src/include/utils/skipsupport.h
new file mode 100644
index 000000000..a71a624d0
--- /dev/null
+++ b/src/include/utils/skipsupport.h
@@ -0,0 +1,140 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.h
+ * Support routines for B-Tree skip scans.
+ *
+ * B-Tree operator classes for discrete types can optionally provide a support
+ * function for skipping. This is used during skip scans.
+ *
+ * A B-tree operator class that implements skip support provides B-tree index
+ * scans with a way of enumerating and iterating through every possible value
+ * from the domain of indexable values. This gives scans a way to determine
+ * the next value in line for a given skip array/scan key/skipped attribute.
+ * This happens at the point where the scan determines that another primitive
+ * index scan is required. The next value is used (in combination with at
+ * least one additional lower-order non-skip key, taken from the SQL query) to
+ * relocate the scan, skipping over many irrelevant leaf pages in the process.
+ *
+ * There are many data types/opclasses where implementing a skip support
+ * scheme is inherently impossible (or at least impractical). Obviously, it
+ * would be wrong if the "next" value generated by an opclass was actually
+ * after the true next value (any index tuples with the true next value would
+ * be overlooked by the index scan). This partly explains why opclasses are
+ * under no obligation to implement skip support: a continuous type may have
+ * no way of generating a useful next value.
+ *
+ * Skip scan generally works best with discrete types such as integer, date,
+ * and boolean: types where we expect indexes to contain large groups of
+ * contiguous values (in respect of the leading/skipped index attribute).
+ * When gaps/discontinuities are naturally rare (e.g., a leading identity
+ * column in a composite index, a date column preceding a product_id column),
+ * then it makes sense for the skip scan to optimistically assume that the
+ * next distinct indexable value will find directly matching index tuples.
+ * The B-Tree code can fall back on explicit next-key probes for any opclass
+ * that doesn't include a skip support function, but it's best to provide skip
+ * support whenever possible. The B-Tree code assumes that it's always better
+ * to use the opclass skip support routine where available.
+ *
+ * When a skip scan "bets" that the next indexable value will find an exact
+ * match, there is significant upside, without any accompanying downside.
+ * When this optimistic strategy works out, the scan avoids the cost of an
+ * explicit probe (used in the no-skip-support case to determine the true next
+ * value in the index's skip attribute). When the strategy doesn't work out,
+ * then the scan is no worse off than it would have been without skip support.
+ * The explicit next-key probes used by B-Tree skip scan's fallback path are
+ * very similar to "failed" optimistic searches for the next indexable value
+ * (the next value according to the opclass skip support routine).
+ *
+ * (FIXME Actually, nbtree does no such thing right now, which is considered a
+ * blocker to commit.)
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/skipsupport.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SKIPSUPPORT_H
+#define SKIPSUPPORT_H
+
+#include "utils/relcache.h"
+
+typedef struct SkipSupportData *SkipSupport;
+
+/*
+ * State/callbacks used by skip arrays to procedurally generate elements.
+ *
+ * A BTSKIPSUPPORT_PROC function must set each and every field when called.
+ * If an opclass can only set some of the fields, then it cannot safely
+ * provide a skip support routine (and so must rely on the fallback strategy
+ * used by continuous types, such as numeric).
+ */
+typedef struct SkipSupportData
+{
+ /*
+ * low_elem and high_elem must be set with the lowest and highest possible
+ * values from the domain of indexable values (assuming standard ascending
+ * order). This helps the B-Tree code with finding its initial position
+ * at the leaf level (during the skip scan's first primitive index scan).
+ * In other words, it gives the B-Tree code a useful value to start from,
+ * before any data has been read from the index.
+ *
+ * low_elem and high_elem can also be used to prove that a qual is
+ * unsatisfiable in certain cross-type scenarios.
+ *
+ * low_elem and high_elem are also used by skip scans to determine when
+ * they've reached the final possible value (in the current direction).
+ * It's typical for the scan to run out of leaf pages before it runs out
+ * of unscanned indexable values, but it's still useful for the scan to
+ * have a way to recognize when it has reached the last possible value
+ * (this saves us a useless probe that just lands on the final leaf page).
+ *
+ * Note: the logic for determining that the scan has reached the final
+ * possible value naturally belongs in the B-Tree code. The final value
+ * isn't necessarily the original high_elem/low_elem set by the opclass.
+ * In particular, it'll be a lower/higher value when B-Tree preprocessing
+ * determines that the true range of possible values should be restricted,
+ * due to the presence of an inequality applied to the index's skipped
+ * attribute. These are range skip scans.
+ */
+ Datum low_elem; /* lowest sorting/leftmost non-NULL value */
+ Datum high_elem; /* highest sorting/rightmost non-NULL value */
+
+ /*
+ * Decrement/increment functions.
+ *
+ * Returns a decremented/incremented copy of caller's existing datum,
+ * allocated in caller's memory context (in the case of pass-by-reference
+ * types). It's not okay for these functions to leak any memory.
+ *
+ * Both decrement and increment callbacks are guaranteed to never be
+ * called with a NULL "existing" arg. (In general it is the B-Tree code's
+ * job to worry about NULLs, and about whether indexed values are stored
+ * in ASC order or DESC order.)
+ *
+ * The decrement callback is guaranteed to only be called with an
+ * "existing" value that's strictly > the low_elem set by the opclass.
+ * Similarly, the increment callback is guaranteed to only be called with
+ * an "existing" value that's strictly < the high_elem set by the opclass.
+ * Consequently, opclasses don't have to deal with "overflow" themselves
+ * (though asserting that the B-Tree code got it right is a good idea).
+ *
+ * It's quite possible (and very common) for the B-Tree skip scan caller's
+ * "existing" datum to just be a straight copy of a value that it copied
+ * from the index. Operator classes must be liberal in accepting every
+ * possible representational variation within the underlying data type.
+ * Opclasses don't have to preserve whatever semantically insignificant
+ * information the data type might be carrying around, though.
+ *
+ * Note: < and > are defined by the opclass's ORDER proc in the usual way.
+ */
+ Datum (*decrement) (Relation rel, Datum existing);
+ Datum (*increment) (Relation rel, Datum existing);
+} SkipSupportData;
+
+extern bool PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype,
+ bool reverse, SkipSupport sksup);
+
+#endif /* SKIPSUPPORT_H */
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 1c72867c8..c451c7b02 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -58,6 +58,7 @@
#include <limits.h>
#include "utils/fmgrprotos.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#ifdef STRESS_SORT_INT_MIN
@@ -78,6 +79,39 @@ btboolcmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32((int32) a - (int32) b);
}
+static Datum
+bool_decrement(Relation rel, Datum existing)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ Assert(bexisting == true);
+
+ return BoolGetDatum(bexisting - 1);
+}
+
+static Datum
+bool_increment(Relation rel, Datum existing)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ Assert(bexisting == false);
+
+ return BoolGetDatum(bexisting + 1);
+}
+
+Datum
+btboolskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = bool_decrement;
+ sksup->increment = bool_increment;
+ sksup->low_elem = BoolGetDatum(false);
+ sksup->high_elem = BoolGetDatum(true);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint2cmp(PG_FUNCTION_ARGS)
{
@@ -105,6 +139,39 @@ btint2sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int2_decrement(Relation rel, Datum existing)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ Assert(iexisting > PG_INT16_MIN);
+
+ return Int16GetDatum(iexisting - 1);
+}
+
+static Datum
+int2_increment(Relation rel, Datum existing)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ Assert(iexisting < PG_INT16_MAX);
+
+ return Int16GetDatum(iexisting + 1);
+}
+
+Datum
+btint2skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int2_decrement;
+ sksup->increment = int2_increment;
+ sksup->low_elem = Int16GetDatum(PG_INT16_MIN);
+ sksup->high_elem = Int16GetDatum(PG_INT16_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint4cmp(PG_FUNCTION_ARGS)
{
@@ -128,6 +195,39 @@ btint4sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int4_decrement(Relation rel, Datum existing)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ Assert(iexisting > PG_INT32_MIN);
+
+ return Int32GetDatum(iexisting - 1);
+}
+
+static Datum
+int4_increment(Relation rel, Datum existing)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ Assert(iexisting < PG_INT32_MAX);
+
+ return Int32GetDatum(iexisting + 1);
+}
+
+Datum
+btint4skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int4_decrement;
+ sksup->increment = int4_increment;
+ sksup->low_elem = Int32GetDatum(PG_INT32_MIN);
+ sksup->high_elem = Int32GetDatum(PG_INT32_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint8cmp(PG_FUNCTION_ARGS)
{
@@ -171,6 +271,39 @@ btint8sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int8_decrement(Relation rel, Datum existing)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ Assert(iexisting > PG_INT64_MIN);
+
+ return Int64GetDatum(iexisting - 1);
+}
+
+static Datum
+int8_increment(Relation rel, Datum existing)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ Assert(iexisting < PG_INT64_MAX);
+
+ return Int64GetDatum(iexisting + 1);
+}
+
+Datum
+btint8skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int8_decrement;
+ sksup->increment = int8_increment;
+ sksup->low_elem = Int64GetDatum(PG_INT64_MIN);
+ sksup->high_elem = Int64GetDatum(PG_INT64_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint48cmp(PG_FUNCTION_ARGS)
{
@@ -292,6 +425,39 @@ btoidsortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+oid_decrement(Relation rel, Datum existing)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ Assert(oexisting > InvalidOid);
+
+ return ObjectIdGetDatum(oexisting - 1);
+}
+
+static Datum
+oid_increment(Relation rel, Datum existing)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ Assert(oexisting < OID_MAX);
+
+ return ObjectIdGetDatum(oexisting + 1);
+}
+
+Datum
+btoidskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = oid_decrement;
+ sksup->increment = oid_increment;
+ sksup->low_elem = ObjectIdGetDatum(InvalidOid);
+ sksup->high_elem = ObjectIdGetDatum(OID_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btoidvectorcmp(PG_FUNCTION_ARGS)
{
@@ -325,3 +491,36 @@ btcharcmp(PG_FUNCTION_ARGS)
/* Be careful to compare chars as unsigned */
PG_RETURN_INT32((int32) ((uint8) a) - (int32) ((uint8) b));
}
+
+static Datum
+char_decrement(Relation rel, Datum existing)
+{
+ char cexisting = DatumGetChar(existing);
+
+ Assert(cexisting > SCHAR_MIN);
+
+ return CharGetDatum(cexisting - 1);
+}
+
+static Datum
+char_increment(Relation rel, Datum existing)
+{
+ char cexisting = DatumGetChar(existing);
+
+ Assert(cexisting < SCHAR_MAX);
+
+ return CharGetDatum(cexisting + 1);
+}
+
+Datum
+btcharskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = char_decrement;
+ sksup->increment = char_increment;
+ sksup->low_elem = CharGetDatum(SCHAR_MIN);
+ sksup->high_elem = CharGetDatum(SCHAR_MAX);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f..5cc520fa4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -324,10 +324,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
BTScanPosInvalidate(so->currPos);
BTScanPosInvalidate(so->markPos);
- if (scan->numberOfKeys > 0)
- so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
- else
- so->keyData = NULL;
+ so->keyData = NULL;
so->needPrimScan = false;
so->scanBehind = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d6de2072d..295179392 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -28,10 +28,44 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/skipsupport.h"
+
+/*
+ * GUC parameter (temporary convenience for reviewers).
+ *
+ * To disable all skipping, set skipscan_prefix_cols=0. Otherwise set it to
+ * the attribute number that you wish to make the last attribute number that
+ * we can add a skip scan key for.
+ *
+ * For example, setting skipscan_prefix_cols=1 before an index scan with qual
+ * "WHERE b = 1 AND c > 42" will make us generate a skip scan key on the
+ * column 'a' (which is attnum 1) only, preventing us from adding one for the
+ * column 'c' (and so 'c' will still have an inequality scan key, required in
+ * only one direction -- 'c' won't be output as a "range" skip key/array).
+ *
+ * The same scan keys will be output when skipscan_prefix_cols=2, given the
+ * same query/qual, since we naturally get a required equality scan key on 'b'
+ * from the input scan keys (provided we at least manage to add a skip scan
+ * key on 'a' that "anchors its required-ness" to the 'b' scan key.)
+ *
+ * When skipscan_prefix_cols is set to the number of key columns in the index,
+ * we're as aggressive as possible about adding skip scan arrays/scan keys.
+ * This is the current default behavior, and the behavior we're targeting for
+ * the committed patch (if there are slowdowns from being maximally aggressive
+ * here then the likely solution is to make _bt_advance_array_keys adaptive,
+ * rather than trying to predict what will work during preprocessing).
+ */
+int skipscan_prefix_cols;
#define LOOK_AHEAD_REQUIRED_RECHECKS 3
#define LOOK_AHEAD_DEFAULT_DISTANCE 5
+typedef struct BTSkipPreproc
+{
+ SkipSupportData sksup; /* opclass skip scan support */
+ Oid eq_op; /* InvalidOid means don't skip */
+} BTSkipPreproc;
+
typedef struct BTSortArrayContext
{
FmgrInfo *sortproc;
@@ -62,18 +96,49 @@ static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
ScanKey arraysk, ScanKey skey,
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok);
-static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys);
static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
+static int _bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts);
+static bool _bt_skip_support(Relation rel, int add_skip_attno,
+ BTSkipPreproc *skipatts);
+static inline Datum _bt_apply_decrement(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array);
+static inline Datum _bt_apply_increment(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur);
+ Datum arrdatum, bool arrnull,
+ ScanKey cur);
+static void _bt_apply_compare_array(ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
+static void _bt_apply_compare_skiparray(IndexScanDesc scan, ScanKey arraysk,
+ ScanKey skey, FmgrInfo *orderproc,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
static int _bt_binsrch_array_skey(FmgrInfo *orderproc,
bool cur_elem_trig, ScanDirection dir,
Datum tupdatum, bool tupnull,
BTArrayKeyInfo *array, ScanKey cur,
int32 *set_elem_result);
+static void _bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result);
+static void _bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_low_or_high(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array, bool low_not_high);
+static void _bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull);
+static void _bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_advance_skip_array_key_increment(Relation rel, ScanDirection dir,
+ BTArrayKeyInfo *array, ScanKey skey,
+ FmgrInfo *orderproc);
static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
@@ -251,9 +316,6 @@ _bt_freestack(BTStack stack)
* It is convenient for _bt_preprocess_keys caller to have to deal with no
* more than one equality strategy array scan key per index attribute. We'll
* always be able to set things up that way when complete opfamilies are used.
- * Eliminated array scan keys can be recognized as those that have had their
- * sk_strategy field set to InvalidStrategy here by us. Caller should avoid
- * including these in the scan's so->keyData[] output array.
*
* We set the scan key references from the scan's BTArrayKeyInfo info array to
* offsets into the temp modified input array returned to caller. Scans that
@@ -261,18 +323,36 @@ _bt_freestack(BTStack stack)
* preprocessing steps are complete. This will convert the scan key offset
* references into references to the scan's so->keyData[] output scan keys.
*
+ * We're also responsible for generating skip arrays (and their associated
+ * scan keys) here. This enables skip scan. We do this for index attributes
+ * that initially lacked an equality condition within scan->keyData[], iff
+ * doing so allows a later scan key (that was passed to us in scan->keyData[])
+ * to be marked required by later preprocessing on output.
+ * _bt_decide_skipatts decides which attributes receive skip arrays.
+ *
+ * Caller must pass *numberOfKeys to give us a way to change the number of
+ * input scan keys (our output is caller's input). The returned array can be
+ * smaller than scan->keyData[] when we eliminated a redundant array scan key
+ * (redundant with some other array scan key, for the same attribute). It can
+ * also be larger when we added a skip array/skip scan key. Caller uses this
+ * to allocate so->keyData[] for the current btrescan.
+ *
* Note: the reason we need to return a temp scan key array, rather than just
* scribbling on scan->keyData, is that callers are permitted to call btrescan
* without supplying a new set of scankey data.
*/
static ScanKey
-_bt_preprocess_array_keys(IndexScanDesc scan)
+_bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- int numberOfKeys = scan->numberOfKeys;
+ int numArrayKeyData = scan->numberOfKeys;
int16 *indoption = rel->rd_indoption;
- int numArrayKeys;
+ BTSkipPreproc skipatts[INDEX_MAX_KEYS];
+ int numArrayKeys,
+ numSkipArrayKeys,
+ output_ikey = 0;
+ AttrNumber attno_skip = 1;
int origarrayatt = InvalidAttrNumber,
origarraykey = -1;
Oid origelemtype = InvalidOid;
@@ -280,11 +360,14 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
MemoryContext oldContext;
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- Assert(numberOfKeys);
+ Assert(scan->numberOfKeys);
- /* Quick check to see if there are any array keys */
+ /*
+ * Quick check to see if there are any array keys, or any missing keys we
+ * can generate a "skip scan" array key for ourselves
+ */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int i = 0; i < scan->numberOfKeys; i++)
{
cur = &scan->keyData[i];
if (cur->sk_flags & SK_SEARCHARRAY)
@@ -300,6 +383,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
}
}
+ /* Consider generating skip arrays, and associated equality scan keys */
+ numSkipArrayKeys = _bt_decide_skipatts(scan, skipatts);
+ if (numSkipArrayKeys)
+ {
+ /* At least one skip array scan key must be added to arrayKeyData[] */
+ numArrayKeys += numSkipArrayKeys;
+ /* output scan key buffer allocation needs space for skip scan keys */
+ numArrayKeyData += numSkipArrayKeys;
+ }
+
/* Quit if nothing to do. */
if (numArrayKeys == 0)
return NULL;
@@ -317,19 +410,23 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
oldContext = MemoryContextSwitchTo(so->arrayContext);
- /* Create modifiable copy of scan->keyData in the workspace context */
- arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
- memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
+ /* Create output scan keys in the workspace context */
+ arrayKeyData = (ScanKey) palloc(numArrayKeyData * sizeof(ScanKeyData));
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
/* Allocate space for ORDER procs used to help _bt_checkkeys */
- so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+ so->orderProcs = (FmgrInfo *) palloc(numArrayKeyData * sizeof(FmgrInfo));
- /* Now process each array key */
+ /*
+ * Process each array key, and generate skip arrays as needed. Also copy
+ * every scan->keyData[] input scan key (whether it's an array or not)
+ * into the arrayKeyData array we'll return to our caller (barring any
+ * array scan keys that we could eliminate early through array merging).
+ */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int input_ikey = 0; input_ikey < scan->numberOfKeys; input_ikey++)
{
FmgrInfo sortproc;
FmgrInfo *sortprocp = &sortproc;
@@ -345,14 +442,77 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int num_nonnulls;
int j;
- cur = &arrayKeyData[i];
- if (!(cur->sk_flags & SK_SEARCHARRAY))
- continue;
+ /* Create a skip array and scan key where indicated by skipatts */
+ while (numSkipArrayKeys &&
+ attno_skip <= scan->keyData[input_ikey].sk_attno)
+ {
+ Oid opcintype = rel->rd_opcintype[attno_skip - 1];
+ Oid collation = rel->rd_indcollation[attno_skip - 1];
+ Oid eq_op = skipatts[attno_skip - 1].eq_op;
+ RegProcedure cmp_proc;
+
+ if (!OidIsValid(skipatts[attno_skip - 1].eq_op))
+ {
+ /* won't skip using this attribute */
+ attno_skip++;
+ continue;
+ }
+
+ cmp_proc = get_opcode(eq_op);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing oprcode for skipping equals operator %u", eq_op);
+
+ cur = &arrayKeyData[output_ikey];
+ ScanKeyEntryInitialize(cur,
+ SK_SEARCHARRAY | SK_BT_SKIP, /* flags */
+ attno_skip, /* skipped att number */
+ BTEqualStrategyNumber, /* equality strategy */
+ InvalidOid, /* opclass input subtype */
+ collation, /* index column's collation */
+ cmp_proc, /* equality operator's proc */
+ (Datum) 0); /* constant */
+
+ /* Initialize array fields */
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
+ so->arrayKeys[numArrayKeys].num_elems = -1;
+ so->arrayKeys[numArrayKeys].cur_elem = 0;
+ so->arrayKeys[numArrayKeys].elem_values = NULL;
+ so->arrayKeys[numArrayKeys].null_elem = true; /* for now */
+ so->arrayKeys[numArrayKeys].sksup = skipatts[attno_skip - 1].sksup;
+
+ /*
+ * We'll need a 3-way ORDER proc to determine when and how the
+ * consed-up "array" will advance inside _bt_advance_array_keys.
+ * Set one up now.
+ */
+ _bt_setup_array_cmp(scan, cur, opcintype,
+ &so->orderProcs[output_ikey], NULL);
+
+ /*
+ * Prepare to output next scan key (might be another skip scan
+ * key, or it could be an input scan key from scan->keyData[])
+ */
+ numSkipArrayKeys--;
+ numArrayKeys++;
+ attno_skip++;
+ output_ikey++; /* keep this scan key/array */
+ }
/*
- * First, deconstruct the array into elements. Anything allocated
- * here (including a possibly detoasted array value) is in the
- * workspace context.
+ * Copy input scan key into temp arrayKeyData scan key array. (From
+ * here on, cur points at our copy of the input scan key.)
+ */
+ cur = &arrayKeyData[output_ikey];
+ *cur = scan->keyData[input_ikey];
+
+ if (!(cur->sk_flags & SK_SEARCHARRAY))
+ {
+ output_ikey++; /* keep this scan key */
+ continue;
+ }
+
+ /*
+ * Deconstruct the array into elements
*/
arrayval = DatumGetArrayTypeP(cur->sk_argument);
/* We could cache this data, but not clear it's worth it */
@@ -406,6 +566,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTGreaterStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
case BTEqualStrategyNumber:
/* proceed with rest of loop */
@@ -416,6 +577,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTLessStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
default:
elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -432,7 +594,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* sortproc just points to the same proc used during binary searches.
*/
_bt_setup_array_cmp(scan, cur, elemtype,
- &so->orderProcs[i], &sortprocp);
+ &so->orderProcs[output_ikey], &sortprocp);
/*
* Sort the non-null elements and eliminate any duplicates. We must
@@ -476,11 +638,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
break;
}
- /*
- * Indicate to _bt_preprocess_keys caller that it must ignore
- * this scan key
- */
- cur->sk_strategy = InvalidStrategy;
+ /* Throw away this array (by not incrementing output_ikey) */
continue;
}
@@ -511,12 +669,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* Note: _bt_preprocess_array_keys_final will fix-up each array's
* scan_key field later on, after so->keyData[] has been finalized.
*/
- so->arrayKeys[numArrayKeys].scan_key = i;
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
numArrayKeys++;
+ output_ikey++; /* keep this scan key/array */
}
+ /* Set final number of arrayKeyData[] keys, array keys */
+ *numberOfKeys = output_ikey;
so->numArrayKeys = numArrayKeys;
MemoryContextSwitchTo(oldContext);
@@ -624,7 +785,8 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
{
BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
- Assert(array->num_elems > 0);
+ Assert(array->num_elems > 0 || array->num_elems == -1);
+ Assert(array->num_elems != -1 || outkey->sk_flags & SK_BT_REQFWD);
if (array->scan_key == input_ikey)
{
@@ -685,6 +847,245 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
so->numArrayKeys, INDEX_MAX_KEYS)));
}
+/*
+ * _bt_decide_skipatts() -- set index attributes requiring skip arrays
+ *
+ * _bt_preprocess_array_keys helper function. Determines which attributes
+ * will require skip arrays/scan keys. Also sets up skip support function for
+ * each of these attributes.
+ *
+ * This sets up "skip scan". Adding skip arrays (and associated scan keys)
+ * allows _bt_preprocess_keys to mark lower-order scan keys (copied from the
+ * original scan->keyData[] array in the conventional way) as required. The
+ * overall effect is to enable skipping over irrelevant sections of the index.
+ *
+ * Return value is the total number of scan keys to add as "input" scan keys
+ * for further processing within _bt_preprocess_keys.
+ */
+static int
+_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)
+{
+ Relation rel = scan->indexRelation;
+ ScanKey inputsk;
+ AttrNumber attno_inputsk = 1,
+ attno_skip = 1;
+ bool attno_has_equal = false,
+ attno_has_rowcompare = false;
+ int numSkipArrayKeys = 0,
+ prev_numSkipArrayKeys = 0;
+
+ Assert(scan->numberOfKeys);
+
+ /*
+ * XXX Don't support system catalogs for now. Calls to routines like
+ * get_opfamily_member() are prone to infinite recursion, which we'll need
+ * to find workaround for (hard-coded lookups?).
+ */
+ if (IsCatalogRelation(rel))
+ return 0;
+
+ /*
+ * FIXME Also don't support parallel scans for now. Must add logic to
+ * places like _bt_parallel_primscan_schedule so that we account for skip
+ * arrays when parallel workers serialize their array scan state.
+ */
+ if (scan->parallel_scan)
+ return 0;
+
+ inputsk = &scan->keyData[0];
+ for (int i = 0;; inputsk++, i++)
+ {
+ /*
+ * Backfill skip arrays for any wholly omitted attributes prior to
+ * attno_inputsk
+ */
+ while (attno_skip < attno_inputsk)
+ {
+ if (!_bt_skip_support(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /*
+ * Opclass lacks a suitable skip support routine.
+ *
+ * Return prev_numSkipArrayKeys, so as to avoid including any
+ * "backfilled" arrays that were supposed to form a contiguous
+ * group with a skip array on this attribute. There is no
+ * benefit to adding backfill skip arrays unless we can do so
+ * for all attributes (all attributes up to and including the
+ * one immediately before attno_inputsk).
+ */
+ return prev_numSkipArrayKeys;
+ }
+
+ /* plan on adding a backfill skip array for this attribute */
+ numSkipArrayKeys++;
+ attno_skip++;
+ }
+
+ /*
+ * Stop once past the final input scan key. We deliberately never add
+ * a skip attribute for the attribute of the last input scan key.
+ *
+ * If the last input scan key(s) use equality strategy, then a skip
+ * attribute is superfluous at best. If the last input scan key uses
+ * an inequality strategy, then adding a skip scan array/scan key is a
+ * valid though suboptimal transformation. It is better to arrange
+ * for preprocessing to allow such an input inequality scan key to
+ * remain an inequality on output. That way _bt_checkkeys will be
+ * able to make best use of both of its precheck optimizations, but
+ * _bt_first will be no less capable of efficiently finding the
+ * starting position for each primitive index scan.
+ */
+ if (i >= scan->numberOfKeys)
+ break;
+
+ /*
+ * Cannot keep adding skip arrays after a RowCompare
+ */
+ if (attno_has_rowcompare)
+ break;
+
+ /*
+ * Apply temporary testing GUC that can be used to disable skipping
+ * (either in part or in whole)
+ */
+ if (attno_inputsk > skipscan_prefix_cols)
+ break;
+
+ /*
+ * Now consider next attno_inputsk (or keep going if this is an
+ * additional scan key against the same attribute)
+ */
+ if (attno_inputsk < inputsk->sk_attno)
+ {
+ prev_numSkipArrayKeys = numSkipArrayKeys;
+
+ /*
+ * Now add skip array for previous scan key's attribute, though
+ * only if the attribute has no equality strategy scan keys.
+ *
+ * Adding skip arrays to an attribute that has one or more
+ * inequality scan keys will cause preprocessing to output a range
+ * skip array. This will happen when preprocessing proper deals
+ * with the redundancy between the array and its inequalities.
+ */
+ skipatts[attno_skip - 1].eq_op = InvalidOid;
+ if (!attno_has_equal)
+ {
+ /* Only saw inequalities for the prior attribute */
+ if (_bt_skip_support(rel, attno_skip,
+ &skipatts[attno_skip - 1]))
+ {
+ /* add a range skip array for this attribute */
+ numSkipArrayKeys++;
+ }
+ else
+ break;
+ }
+ else
+ {
+ /*
+ * Saw an equality for the prior attribute, so it doesn't need
+ * a skip array (not even a range skip array). We'll be able
+ * to add later skip arrays, too (doesn't matter if the prior
+ * attribute uses an input opclass without skip support).
+ */
+ }
+
+ /* Set things up for this new attribute */
+ attno_skip++;
+ attno_inputsk = inputsk->sk_attno;
+ attno_has_equal = false;
+ }
+
+ /*
+ * Track if this scan key's attribute has any equality strategy scan
+ * keys.
+ *
+ * Treat IS NULL scan keys as using equal strategy (they'll be marked
+ * as using it later on, by _bt_fix_scankey_strategy).
+ */
+ if (inputsk->sk_strategy == BTEqualStrategyNumber ||
+ (inputsk->sk_flags & SK_SEARCHNULL))
+ attno_has_equal = true;
+
+ /*
+ * We don't support RowCompare transformation. Remember that we saw a
+ * RowCompare, so that we don't keep adding skip attributes.
+ *
+ * We do still backfill skip attributes before the RowCompare, so that
+ * it can be marked required. This is similar to what happens when a
+ * conventional inequality uses an opclass that lacks skip support.
+ */
+ if (inputsk->sk_flags & SK_ROW_HEADER)
+ attno_has_rowcompare = true;
+ }
+
+ return numSkipArrayKeys;
+}
+
+/*
+ * _bt_skip_support() -- set up skip support function in *skipatts
+ *
+ * Returns true on success, indicating that we set *skipatts with input
+ * opclass's skip support routine for caller. Otherwise returns false.
+ */
+static bool
+_bt_skip_support(Relation rel, int add_skip_attno, BTSkipPreproc *skipatts)
+{
+ int16 *indoption = rel->rd_indoption;
+ Oid opfamily = rel->rd_opfamily[add_skip_attno - 1];
+ Oid opcintype = rel->rd_opcintype[add_skip_attno - 1];
+ bool reverse;
+
+ /* Look up input opclass's equality operator */
+ skipatts->eq_op = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ /*
+ * We don't expect input opclasses lacking even an equality operator, but
+ * it's possible. Deal with it gracefully.
+ */
+ if (!OidIsValid(skipatts->eq_op))
+ return false;
+
+ /* Have skip support infrastructure set all SkipSupport fields */
+ reverse = (indoption[add_skip_attno - 1] & INDOPTION_DESC) != 0;
+ return PrepareSkipSupportFromOpclass(opfamily, opcintype, reverse,
+ &skipatts->sksup);
+}
+
+/*
+ * _bt_apply_decrement() -- Get a decremented copy of skey's arg
+ *
+ * Note: this wrapper function calls the opclass increment function when the
+ * index stores values in descending order. We're "logically decrementing" to
+ * the previous value in the key space regardless.
+ */
+static inline Datum
+_bt_apply_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ if (!(skey->sk_flags & SK_BT_DESC))
+ return array->sksup.decrement(rel, skey->sk_argument);
+ else
+ return array->sksup.increment(rel, skey->sk_argument);
+}
+
+/*
+ * _bt_apply_increment() -- Get an incremented copy of skey's arg
+ *
+ * Note: this wrapper function calls the opclass decrement function when the
+ * index stores values in descending order. We're "logically incrementing" to
+ * the next value in the key space regardless.
+ */
+static inline Datum
+_bt_apply_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ if (!(skey->sk_flags & SK_BT_DESC))
+ return array->sksup.increment(rel, skey->sk_argument);
+ else
+ return array->sksup.decrement(rel, skey->sk_argument);
+}
+
/*
* _bt_setup_array_cmp() -- Set up array comparison functions
*
@@ -979,15 +1380,10 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
{
Relation rel = scan->indexRelation;
Oid opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
- int cmpresult = 0,
- cmpexact = 0,
- matchelem,
- new_nelems = 0;
FmgrInfo crosstypeproc;
FmgrInfo *orderprocp = orderproc;
Assert(arraysk->sk_attno == skey->sk_attno);
- Assert(array->num_elems > 0);
Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
arraysk->sk_strategy == BTEqualStrategyNumber);
@@ -1000,8 +1396,8 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
* datum of opclass input type for the index's attribute (on-disk type).
* We can reuse the array's ORDER proc whenever the non-array scan key's
* type is a match for the corresponding attribute's input opclass type.
- * Otherwise, we have to do another ORDER proc lookup so that our call to
- * _bt_binsrch_array_skey applies the correct comparator.
+ * Otherwise, we have to do another ORDER proc lookup. We have to be sure
+ * that _bt_compare_array_skey/_bt_binsrch_array_skey use the right proc.
*
* Note: we have to support the convention that sk_subtype == InvalidOid
* means the opclass input type; this is a hack to simplify life for
@@ -1032,11 +1428,46 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
return false;
}
- /* We have all we need to determine redundancy/contradictoriness */
orderprocp = &crosstypeproc;
fmgr_info(cmp_proc, orderprocp);
}
+ /*
+ * We have all we need to determine redundancy/contradictoriness.
+ *
+ * Perform preprocessing of the array based on whether it's a conventional
+ * array, or a skip array. Sets *qual_ok correctly in passing.
+ */
+ if (array->num_elems != -1)
+ _bt_apply_compare_array(arraysk, skey,
+ orderprocp, array, qual_ok);
+ else
+ _bt_apply_compare_skiparray(scan, arraysk, skey,
+ orderproc, orderprocp,
+ array, qual_ok);
+
+ return true;
+}
+
+/*
+ * Finish off preprocessing of conventional (non-skip) array scan key when it
+ * is redundant with (or contradicted by) a non-array scalar scan key.
+ *
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ */
+static void
+_bt_apply_compare_array(ScanKey arraysk, ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ int cmpresult = 0,
+ cmpexact = 0,
+ matchelem,
+ new_nelems = 0;
+
+ Assert(array->num_elems > 0);
+ Assert(!(arraysk->sk_flags & SK_BT_SKIP));
+
matchelem = _bt_binsrch_array_skey(orderprocp, false,
NoMovementScanDirection,
skey->sk_argument, false, array,
@@ -1088,8 +1519,152 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
array->num_elems = new_nelems;
*qual_ok = new_nelems > 0;
+}
- return true;
+/*
+ * Finish off preprocessing of skip array scan key when it is redundant with
+ * (or contradicted by) a non-array scalar scan key.
+ *
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Arrays used to skip (skip scan/missing key attribute predicates) work by
+ * procedurally generating their elements on the fly. We must still
+ * "eliminate contradictory elements", but it works a little differently: we
+ * narrow the range of the skip array, such that the array will never
+ * generated contradicted-by-skey elements.
+ *
+ * FIXME Our behavior in scenarios with cross-type operators (range skip scan
+ * cases) is buggy. We're naively copying datums of a different type from
+ * scalar inequality scan keys into the array's low_value and high_value
+ * fields. In practice this tends to not visibly break (in practice types
+ * that appear within the same operator family tend to have compatible datum
+ * representations, at least on systems with little-endian byte order). Put
+ * off dealing with the problem until a later revision of the patch.
+ *
+ * It seems likely that the best way to fix this problem will involve keeping
+ * around the original operator in the BTArrayKeyInfo array struct whenever
+ * we're passed a "redundant" cross-type inequality operator (an approach
+ * involving casts/coercions might be tempting, but seems much too fragile).
+ * We only need to use not-column-input-opclass-type operators for the first
+ * and/or last array elements from the skip array under this scheme; we'll
+ * still mostly be dealing with opcintype-typed datums, copied from the index
+ * (as well as incrementing/decrementing copies of those index tuple datums).
+ * Importantly, this scheme should work just as well with an opfamily that
+ * doesn't even have an orderprocp cross-type ORDER operator to pass us here
+ * (we might even have to keep more than one same-strategy inequality, since
+ * in general _bt_preprocess_keys might not be able to prove which inequality
+ * is redundant).
+ */
+static void
+_bt_apply_compare_skiparray(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderproc, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ Relation rel = scan->indexRelation;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Form_pg_attribute attr = TupleDescAttr(RelationGetDescr(rel),
+ skey->sk_attno - 1);
+ MemoryContext oldContext;
+ int cmpresult;
+
+ /*
+ * We don't expect to have to deal with NULLs in non-array/non-skip scan
+ * key. We expect _bt_preprocess_array_keys to avoid generating a skip
+ * array for an index attribute with an IS NULL input scan key. It will
+ * still do so in the presence of IS NOT NULL input scan keys, but
+ * _bt_compare_scankey_args is expected to handle those for us.
+ */
+ Assert(arraysk->sk_flags & SK_BT_SKIP);
+ Assert(arraysk->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_ISNULL));
+ Assert(array->num_elems == -1);
+
+ /*
+ * Scalar scan key must be a B-Tree operator, which must always be strict.
+ * Array shouldn't generate a NULL "array element"/an IS NULL qual. This
+ * isn't just an optimization; it's strictly necessary for correctness.
+ */
+ array->null_elem = false;
+
+ switch (skey->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+
+ /*
+ * detect if scan key argument will be < low_value once
+ * decremented
+ */
+ cmpresult = _bt_compare_array_skey(orderprocp,
+ skey->sk_argument, false,
+ array->sksup.low_elem, false,
+ arraysk);
+ if (cmpresult <= 0)
+ {
+ /* decrementing would make qual unsatisfiable, so don't try */
+ *qual_ok = false;
+ return;
+ }
+
+ /* decremented scan key value becomes skip array's new high_value */
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.high_elem = _bt_apply_decrement(rel, skey, array);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ case BTLessEqualStrategyNumber:
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.high_elem = datumCopy(skey->sk_argument,
+ attr->attbyval, attr->attlen);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ case BTEqualStrategyNumber:
+ /* _bt_preprocess_array_keys should have avoided this */
+ elog(ERROR, "equality strategy scan key conflicts with skip key for attribute %d on index \"%s\"",
+ skey->sk_attno, RelationGetRelationName(rel));
+ break;
+ case BTGreaterEqualStrategyNumber:
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.low_elem = datumCopy(skey->sk_argument,
+ attr->attbyval, attr->attlen);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ case BTGreaterStrategyNumber:
+
+ /*
+ * detect if scan key argument will be > high_value once
+ * incremented
+ */
+ cmpresult = _bt_compare_array_skey(orderprocp,
+ skey->sk_argument, false,
+ array->sksup.high_elem, false,
+ arraysk);
+ if (cmpresult >= 0)
+ {
+ /* incrementing would make qual unsatisfiable, so don't try */
+ *qual_ok = false;
+ return;
+ }
+
+ /* incremented scan key value becomes skip array's new low_value */
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.low_elem = _bt_apply_increment(rel, skey, array);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ default:
+ elog(ERROR, "unrecognized StrategyNumber: %d",
+ (int) skey->sk_strategy);
+ break;
+ }
+
+ /*
+ * Is the qual contradictory, or is it merely "redundant" with consed-up
+ * skip array?
+ */
+ cmpresult = _bt_compare_array_skey(orderproc, /* don't use orderprocp */
+ array->sksup.low_elem, false,
+ array->sksup.high_elem, false,
+ arraysk);
+ *qual_ok = (cmpresult <= 0);
}
/*
@@ -1130,7 +1705,8 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
static inline int32
_bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur)
+ Datum arrdatum, bool arrnull,
+ ScanKey cur)
{
int32 result = 0;
@@ -1138,14 +1714,14 @@ _bt_compare_array_skey(FmgrInfo *orderproc,
if (tupnull) /* NULL tupdatum */
{
- if (cur->sk_flags & SK_ISNULL)
+ if (arrnull)
result = 0; /* NULL "=" NULL */
else if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+ else if (arrnull) /* NOT_NULL tupdatum, NULL arrdatum */
{
if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -1211,6 +1787,8 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
Datum arrdatum;
Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(!(cur->sk_flags & SK_BT_SKIP));
+ Assert(!(cur->sk_flags & SK_ISNULL)); /* plain arrays can't do this */
Assert(cur->sk_strategy == BTEqualStrategyNumber);
if (cur_elem_trig)
@@ -1246,7 +1824,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[low_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result <= 0)
{
@@ -1274,7 +1852,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[high_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result >= 0)
{
@@ -1301,7 +1879,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
arrdatum = array->elem_values[mid_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result == 0)
{
@@ -1326,13 +1904,123 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
*/
if (low_elem != mid_elem)
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- array->elem_values[low_elem], cur);
+ array->elem_values[low_elem], false,
+ cur);
*set_elem_result = result;
return low_elem;
}
+/*
+ * _bt_binsrch_skiparray_skey() -- "Binary search" within a skip array
+ *
+ * Skip scan arrays procedurally generate their elements on-demand. They
+ * largely function in the same way as standard arrays. They can be rolled
+ * over by standard arrays (standard array can also roll over skip arrays).
+ *
+ * This routine doesn't return an index into the array, because the array
+ * doesn't actually have any elements (it has low_value and high_value, which
+ * indicate the range of values that the array can generate). Note that this
+ * may include a NULL value/an IS NULL qual (unlike with true arrays).
+ *
+ * Sets *set_elem_result just like _bt_binsrch_array_skey would with a true
+ * array. The value 0 indicates that tupdatum/tupnull is within the range of
+ * the skip array. Other values indicate what _bt_compare_array_skey returned
+ * for the best available match to tupdatum/tupnull (in practice this means
+ * either the lowest item or the highest item in the range of the array).
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this skip
+ * array's scan key. We can apply this information to find the next matching
+ * array element in the current scan direction using fewer comparisons.
+ */
+static void
+_bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result)
+{
+ Datum arrdatum;
+ bool arrnull;
+
+ Assert(!ScanDirectionIsNoMovement(dir));
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_flags & SK_BT_REQFWD);
+ Assert(array->num_elems == -1);
+
+ /*
+ * Compare tupdatum against "first array element" in the current scan
+ * direction first (and allow NULL to be treated as a possible element).
+ *
+ * Optimization: don't have to bother with this when passed a skip array
+ * that is known to have triggered array advancement.
+ */
+ if (!cur_elem_trig)
+ {
+ if (ScanDirectionIsForward(dir))
+ {
+ arrdatum = array->sksup.low_elem;
+ arrnull = array->null_elem && (cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+ else
+ {
+ arrdatum = array->sksup.high_elem;
+ arrnull = array->null_elem && !(cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+
+ *set_elem_result = _bt_compare_array_skey(orderproc,
+ tupdatum, tupnull,
+ arrdatum, arrnull,
+ cur);
+
+ /*
+ * Optimization: return early when >= lower bound happens to be an
+ * exact match (or when <= upper bound is an exact match during a
+ * backwards scan)
+ */
+ if (*set_elem_result == 0)
+ return;
+
+ /* Is tupdatum "before the start" of our lowest "element"? */
+ if ((ScanDirectionIsForward(dir) && *set_elem_result < 0) ||
+ (ScanDirectionIsBackward(dir) && *set_elem_result > 0))
+ return;
+ }
+
+ /*
+ * Now compare tupdatum to the "last array element" in the current scan
+ * direction (and allow NULL to be treated as a possible element)
+ */
+ if (ScanDirectionIsForward(dir))
+ {
+ arrdatum = array->sksup.high_elem;
+ arrnull = array->null_elem && !(cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+ else
+ {
+ arrdatum = array->sksup.low_elem;
+ arrnull = array->null_elem && (cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+
+ *set_elem_result = _bt_compare_array_skey(orderproc,
+ tupdatum, tupnull,
+ arrdatum, arrnull,
+ cur);
+
+ /* Is tupdatum "after the end" of our highest "element"? */
+ if ((ScanDirectionIsForward(dir) && *set_elem_result > 0) ||
+ (ScanDirectionIsBackward(dir) && *set_elem_result < 0))
+ return;
+
+ /*
+ * tupdatum must be within the range of the skip array. Have our caller
+ * treat tupdatum as one of the array's elements.
+ */
+ *set_elem_result = 0;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -1342,29 +2030,256 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
void
_bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- int i;
Assert(so->numArrayKeys);
Assert(so->qual_ok);
- for (i = 0; i < so->numArrayKeys; i++)
+ for (int i = 0; i < so->numArrayKeys; i++)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
- Assert(curArrayKey->num_elems > 0);
Assert(skey->sk_flags & SK_SEARCHARRAY);
- if (ScanDirectionIsBackward(dir))
- curArrayKey->cur_elem = curArrayKey->num_elems - 1;
- else
- curArrayKey->cur_elem = 0;
- skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
+ _bt_scankey_set_low_or_high(rel, skey, curArrayKey,
+ ScanDirectionIsForward(dir));
}
so->scanBehind = false;
}
+/*
+ * _bt_scankey_decrement() -- decrement scan key's sk_argument
+ *
+ * Unsets scan key "IS NULL" flags when required, and handles memory
+ * management for pass-by-reference types.
+ */
+static void
+_bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (skey->sk_flags & SK_ISNULL)
+ _bt_scankey_unset_isnull(rel, skey, array);
+ else
+ {
+ Datum dec_sk_argument;
+ Form_pg_attribute attr;
+
+ /* Get a decremented copy of existing sk_argument */
+ dec_sk_argument = _bt_apply_decrement(rel, skey, array);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set decremented copy of original sk_argument in scan key */
+ skey->sk_argument = dec_sk_argument;
+ }
+}
+
+/*
+ * _bt_scankey_increment() -- increment scan key's sk_argument
+ *
+ * Unsets scan key "IS NULL" flags when required, and handles memory
+ * management for pass-by-reference types.
+ */
+static void
+_bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (skey->sk_flags & SK_ISNULL)
+ _bt_scankey_unset_isnull(rel, skey, array);
+ else
+ {
+ Datum inc_sk_argument;
+ Form_pg_attribute attr;
+
+ /* Get an incremented copy of existing sk_argument */
+ inc_sk_argument = _bt_apply_increment(rel, skey, array);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set incremented copy of original sk_argument in scan key */
+ skey->sk_argument = inc_sk_argument;
+ }
+}
+
+/*
+ * _bt_scankey_set_low_or_high() -- Set array scan key to lowest/highest element
+ *
+ * Caller also passes associated scan key, which will have its argument set to
+ * the lowest/highest array value in passing.
+ */
+static void
+_bt_scankey_set_low_or_high(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ bool low_not_high)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (array->num_elems != -1)
+ {
+ /* set low or high element for conventional array */
+ int set_elem = 0;
+
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+
+ if (!low_not_high)
+ set_elem = array->num_elems - 1;
+
+ /*
+ * Just copy over array datum (only skip arrays require freeing and
+ * allocating memory for sk_argument)
+ */
+ array->cur_elem = set_elem;
+ skey->sk_argument = array->elem_values[set_elem];
+
+ return;
+ }
+
+ /* set low or high element for skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ if (array->null_elem &&
+ (low_not_high == ((skey->sk_flags & SK_BT_NULLS_FIRST) != 0)))
+ {
+ /* Set element to NULL (lowest/highest element) */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+ }
+ else
+ {
+ /* Lowest array element isn't NULL */
+ if (low_not_high)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL);
+ }
+}
+
+/*
+ * _bt_scankey_set_element() -- Set skip array scan key's sk_argument
+ *
+ * Sets scan key to "IS NULL" when required, and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull)
+{
+ /* tupdatum within the range of low_value/high_value */
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /*
+ * Treat tupdatum/tupnull as a matching array element.
+ *
+ * We just copy tupdatum into the array's scan key (there is no
+ * conventional array element for us to set, of course).
+ */
+ if (tupnull)
+ {
+ /*
+ * Unlike standard arrays, skip arrays sometimes need to locate NULLs.
+ * We can treat them as just another value from the domain of indexed
+ * values.
+ */
+ Assert(array->null_elem);
+
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+ }
+ else
+ {
+ skey->sk_argument = datumCopy(tupdatum,
+ attr->attbyval, attr->attlen);
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL);
+ }
+}
+
+/*
+ * _bt_scankey_unset_isnull() -- increment/decrement scan key from NULL
+ *
+ * Unsets scan key's "IS NULL" marking, and sets the non-NULL value from the
+ * array immediately before (or immediate after) NULL in the key space.
+ */
+static void
+_bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(skey->sk_flags & SK_ISNULL);
+ Assert(array->null_elem);
+
+ /*
+ * sk_argument must be set to whatever non-NULL value comes immediately
+ * before or after NULL
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL);
+ if (skey->sk_flags & SK_BT_NULLS_FIRST)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+}
+
+/*
+ * _bt_scankey_set_isnull() -- increment/decrement scan key to NULL
+ *
+ * Sets scan key to "IS NULL", and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_ISNULL));
+ Assert(array->null_elem);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set sk_argument to NULL */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
/*
* _bt_advance_array_keys_increment() -- Advance to next set of array elements
*
@@ -1380,6 +2295,7 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
static bool
_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
/*
@@ -1391,10 +2307,24 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
+ FmgrInfo *orderproc = &so->orderProcs[curArrayKey->scan_key];
int cur_elem = curArrayKey->cur_elem;
int num_elems = curArrayKey->num_elems;
bool rolled = false;
+ /* Handle incrementing a skip array */
+ if (num_elems == -1)
+ {
+ /* Attempt to incrementally advance this skip scan array */
+ if (_bt_advance_skip_array_key_increment(rel, dir, curArrayKey,
+ skey, orderproc))
+ return true;
+
+ /* Array rolled over. Need to advance next array key, if any. */
+ continue;
+ }
+
+ /* Handle incrementing a true array */
if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
{
cur_elem = 0;
@@ -1411,7 +2341,7 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
if (!rolled)
return true;
- /* Need to advance next array key, if any */
+ /* Array rolled over. Need to advance next array key, if any. */
}
/*
@@ -1429,6 +2359,95 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
return false;
}
+/*
+ * _bt_advance_skip_array_key_increment() -- increment a skip scan array
+ *
+ * Returns true when the skip array was successfully incremented to the next
+ * value in the current scan direction, dir. Otherwise handles roll over by
+ * setting array to its final element for the current scan direction.
+ */
+static bool
+_bt_advance_skip_array_key_increment(Relation rel, ScanDirection dir,
+ BTArrayKeyInfo *array, ScanKey skey,
+ FmgrInfo *orderproc)
+{
+ Datum sk_argument = skey->sk_argument;
+ bool sk_isnull = (skey->sk_flags & SK_ISNULL) != 0;
+ int compare;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(array->num_elems == -1);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* high_elem is final non-NULL element in current scan direction */
+ compare = _bt_compare_array_skey(orderproc,
+ array->sksup.high_elem, false,
+ sk_argument, sk_isnull,
+ skey);
+ if (compare > 0)
+ {
+ /* Increment non-NULL element to next non-NULL element */
+ _bt_scankey_increment(rel, skey, array);
+
+ return true;
+ }
+ else if (compare == 0 && array->null_elem &&
+ !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to high_elem. Increment
+ * from high_elem to final NULL element (without calling opclass
+ * support function, which doesn't know how to handle NULLs).
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+
+ return true;
+ }
+
+ /* Exhausted all array elements in current scan direction */
+ }
+ else
+ {
+ /* low_elem is final non-NULL element in current scan direction */
+ compare = _bt_compare_array_skey(orderproc,
+ array->sksup.low_elem, false,
+ sk_argument, sk_isnull,
+ skey);
+ if (compare < 0)
+ {
+ /* Decrement non-NULL element to previous non-NULL element */
+ _bt_scankey_decrement(rel, skey, array);
+
+ return true;
+ }
+ else if (compare == 0 && array->null_elem &&
+ (skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to low_elem. Decrement
+ * from low_elem to final NULL element (without calling opclass
+ * support function, which doesn't know how to handle NULLs).
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+
+ return true;
+ }
+
+ /* Exhausted all array elements in current scan direction */
+ }
+
+ /*
+ * Skip array rolls over. Start over at the array's lowest sorting value
+ * (or its highest value, for backward scans).
+ */
+ _bt_scankey_set_low_or_high(rel, skey, array, ScanDirectionIsForward(dir));
+
+ /* Caller must consider earlier/more significant arrays in turn */
+ return false;
+}
+
/*
* _bt_rewind_nonrequired_arrays() -- Rewind non-required arrays
*
@@ -1485,6 +2504,8 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
continue;
+ Assert(array->num_elems > 0); /* No skipping of non-required arrays */
+
if (ScanDirectionIsForward(dir))
first_elem_dir = 0;
else
@@ -1558,6 +2579,8 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
{
ScanKey cur = so->keyData + ikey;
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
Datum tupdatum;
bool tupnull;
int32 result;
@@ -1621,7 +2644,8 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull,
+ cur);
/*
* Does this comparison indicate that caller must _not_ advance the
@@ -1954,18 +2978,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (beyond_end_advance)
{
- int final_elem_dir;
-
- if (ScanDirectionIsBackward(dir) || !array)
- final_elem_dir = 0;
- else
- final_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != final_elem_dir)
- {
- array->cur_elem = final_elem_dir;
- cur->sk_argument = array->elem_values[final_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
continue;
}
@@ -1990,18 +3005,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (!all_required_satisfied || cur->sk_attno > tupnatts)
{
- int first_elem_dir;
-
- if (ScanDirectionIsForward(dir) || !array)
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
continue;
}
@@ -2019,15 +3025,27 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
/*
* Binary search for closest match that's available from the array
*/
- set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
- cur_elem_trig, dir,
- tupdatum, tupnull, array, cur,
- &result);
+ if (array->num_elems != -1)
+ set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
- Assert(set_elem >= 0 && set_elem < array->num_elems);
+ /*
+ * Skip array. "Binary search" by checking if tupdatum/tupnull
+ * are within the low_value/high_value range of the skip array.
+ */
+ else
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
}
else
{
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
+
Assert(sktrig_required && required);
/*
@@ -2041,7 +3059,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull, cur);
}
/*
@@ -2061,6 +3079,10 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
* its final element is. Once outside the loop we'll then "increment
* this array's set_elem" by calling _bt_advance_array_keys_increment.
* That way the process rolls over to higher order arrays as needed.
+ * The skip array case will set the array's scan key to the final
+ * valid element for the current scan direction, which is equivalent
+ * (when we have a real set_elem "match" it's just the final element
+ * in the current scan direction).
*
* Under this scheme any required arrays only ever ratchet forwards
* (or backwards), and always do so to the maximum possible extent
@@ -2100,11 +3122,62 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
}
}
- /* Advance array keys, even when set_elem isn't an exact match */
- if (array && array->cur_elem != set_elem)
+ /* Advance array keys, even when we don't have an exact match */
+
+ if (!array)
+ continue; /* no element to set in non-array */
+
+ /* Conventional arrays have a valid set_elem for us to advance to */
+ if (array->num_elems != -1)
{
- array->cur_elem = set_elem;
- cur->sk_argument = array->elem_values[set_elem];
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ cur->sk_argument = array->elem_values[set_elem];
+ }
+
+ continue;
+ }
+
+ /*
+ * Conceptually, skip arrays also have array elements. The actual
+ * elements/values are generated procedurally and on demand.
+ */
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+ Assert(required);
+
+ if (result == 0)
+ {
+ /*
+ * Anything within the range of possible element values is treated
+ * as "a match for one of the array's elements". Store the next
+ * scan key argument value by taking a copy of the tupdatum value
+ * from caller's tuple (or set scan key IS NULL when tupnull, iff
+ * the array's range of possible elements covers NULL).
+ */
+ _bt_scankey_set_element(rel, cur, array, tupdatum, tupnull);
+ }
+ else if (beyond_end_advance)
+ {
+ /*
+ * We need to set the array element to the final "element" in the
+ * current scan direction for "beyond end of array element" array
+ * advancement. See above for an explanation.
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
+ }
+ else
+ {
+ /*
+ * The closest matching element is the lowest element; even that
+ * still puts us ahead of caller's tuple in the key space. This
+ * process has to carry to any lower-order arrays. See above for
+ * an explanation.
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -2550,9 +3623,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
int16 *indoption = scan->indexRelation->rd_indoption;
int new_numberOfKeys;
int numberOfEqualCols;
- ScanKey inkeys;
- ScanKey outkeys;
- ScanKey cur;
+ ScanKey inputsk;
BTScanKeyPreproc xform[BTMaxStrategyNumber];
bool test_result;
int i,
@@ -2584,7 +3655,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
return; /* done if qual-less scan */
/* If any keys are SK_SEARCHARRAY type, set up array-key info */
- arrayKeyData = _bt_preprocess_array_keys(scan);
+ arrayKeyData = _bt_preprocess_array_keys(scan, &numberOfKeys);
if (!so->qual_ok)
{
/* unmatchable array, so give up */
@@ -2598,32 +3669,38 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
if (arrayKeyData)
{
- inkeys = arrayKeyData;
+ inputsk = arrayKeyData;
/* Also maintain keyDataMap for remapping so->orderProc[] later */
keyDataMap = MemoryContextAlloc(so->arrayContext,
numberOfKeys * sizeof(int));
}
else
- inkeys = scan->keyData;
+ inputsk = scan->keyData;
+
+ /*
+ * Now that we have an estimate of the number of output scan keys
+ * (including any skip array scan keys), allocate space for them
+ */
+ if (so->keyData != NULL)
+ pfree(so->keyData);
+ so->keyData = palloc(sizeof(ScanKeyData) * numberOfKeys);
- outkeys = so->keyData;
- cur = &inkeys[0];
/* we check that input keys are correctly ordered */
- if (cur->sk_attno < 1)
+ if (inputsk->sk_attno < 1)
elog(ERROR, "btree index keys must be ordered by attribute");
/* We can short-circuit most of the work if there's just one key */
if (numberOfKeys == 1)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
so->qual_ok = false;
- memcpy(outkeys, cur, sizeof(ScanKeyData));
+ memcpy(so->keyData, inputsk, sizeof(ScanKeyData));
so->numberOfKeys = 1;
/* We can mark the qual as required if it's for first index col */
- if (cur->sk_attno == 1)
- _bt_mark_scankey_required(outkeys);
+ if (inputsk->sk_attno == 1)
+ _bt_mark_scankey_required(so->keyData);
if (arrayKeyData)
{
/*
@@ -2631,8 +3708,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
* (we'll miss out on the single value array transformation, but
* that's not nearly as important when there's only one scan key)
*/
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+ Assert(so->keyData[0].sk_flags & SK_SEARCHARRAY);
+ Assert(so->keyData[0].sk_strategy != BTEqualStrategyNumber ||
(so->arrayKeys[0].scan_key == 0 &&
OidIsValid(so->orderProcs[0].fn_oid)));
}
@@ -2660,12 +3737,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* handle after-last-key processing. Actual exit from the loop is at the
* "break" statement below.
*/
- for (i = 0;; cur++, i++)
+ for (i = 0;; inputsk++, i++)
{
if (i < numberOfKeys)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
{
/* NULL can't be matched, so give up */
so->qual_ok = false;
@@ -2677,12 +3754,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* If we are at the end of the keys for a particular attr, finish up
* processing and emit the cleaned-up keys.
*/
- if (i == numberOfKeys || cur->sk_attno != attno)
+ if (i == numberOfKeys || inputsk->sk_attno != attno)
{
int priorNumberOfEqualCols = numberOfEqualCols;
/* check input keys are correctly ordered */
- if (i < numberOfKeys && cur->sk_attno < attno)
+ if (i < numberOfKeys && inputsk->sk_attno < attno)
elog(ERROR, "btree index keys must be ordered by attribute");
/*
@@ -2741,7 +3818,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
return;
}
/* else discard the redundant non-equality key */
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
xform[j].skey = NULL;
xform[j].ikey = -1;
}
@@ -2786,7 +3864,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
}
/*
- * Emit the cleaned-up keys into the outkeys[] array, and then
+ * Emit the cleaned-up keys into the so->keyData[] array, and then
* mark them if they are required. They are required (possibly
* only in one direction) if all attrs before this one had "=".
*/
@@ -2794,7 +3872,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
{
if (xform[j].skey)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
@@ -2811,19 +3889,19 @@ _bt_preprocess_keys(IndexScanDesc scan)
break;
/* Re-initialize for new attno */
- attno = cur->sk_attno;
+ attno = inputsk->sk_attno;
memset(xform, 0, sizeof(xform));
}
/* check strategy this key's operator corresponds to */
- j = cur->sk_strategy - 1;
+ j = inputsk->sk_strategy - 1;
/* if row comparison, push it directly to the output array */
- if (cur->sk_flags & SK_ROW_HEADER)
+ if (inputsk->sk_flags & SK_ROW_HEADER)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
- memcpy(outkey, cur, sizeof(ScanKeyData));
+ memcpy(outkey, inputsk, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = i;
if (numberOfEqualCols == attno - 1)
@@ -2837,19 +3915,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
continue;
}
- /*
- * Does this input scan key require further processing as an array?
- */
- if (cur->sk_strategy == InvalidStrategy)
- {
- /* _bt_preprocess_array_keys marked this array key redundant */
- Assert(arrayKeyData);
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- continue;
- }
-
- if (cur->sk_strategy == BTEqualStrategyNumber &&
- (cur->sk_flags & SK_SEARCHARRAY))
+ if (inputsk->sk_strategy == BTEqualStrategyNumber &&
+ (inputsk->sk_flags & SK_SEARCHARRAY))
{
/* _bt_preprocess_array_keys kept this array key */
Assert(arrayKeyData);
@@ -2863,7 +3930,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (xform[j].skey == NULL)
{
/* nope, so this scan key wins by default (at least for now) */
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2881,7 +3948,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
/*
* Have to set up array keys
*/
- if ((cur->sk_flags & SK_SEARCHARRAY))
+ if ((inputsk->sk_flags & SK_SEARCHARRAY))
{
array = &so->arrayKeys[arrayidx - 1];
orderproc = so->orderProcs + i;
@@ -2909,13 +3976,15 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
}
- if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
- array, orderproc, &test_result))
+ if (_bt_compare_scankey_args(scan, inputsk, inputsk,
+ xform[j].skey, array, orderproc,
+ &test_result))
{
/* Have all we need to determine redundancy */
if (test_result)
{
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
/*
* New key is more restrictive, and so replaces old key...
@@ -2923,7 +3992,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2936,7 +4005,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
* scan key. _bt_compare_scankey_args expects us to
* always keep arrays (and discard non-arrays).
*/
- Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+ Assert(!(inputsk->sk_flags & SK_SEARCHARRAY));
}
}
else if (j == (BTEqualStrategyNumber - 1))
@@ -2959,14 +4028,14 @@ _bt_preprocess_keys(IndexScanDesc scan)
* even with incomplete opfamilies. _bt_advance_array_keys
* depends on this.
*/
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
if (numberOfEqualCols == attno - 1)
_bt_mark_scankey_required(outkey);
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -3057,10 +4126,11 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
if (array->scan_key != ikey)
return false;
- if (array->num_elems <= 0)
+ if (array->num_elems == 0 || array->num_elems < -1)
return false;
- if (cur->sk_argument != array->elem_values[array->cur_elem])
+ if (array->num_elems != -1 &&
+ cur->sk_argument != array->elem_values[array->cur_elem])
return false;
if (last_sk_attno > cur->sk_attno)
return false;
@@ -3135,6 +4205,22 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool leftnull,
rightnull;
+ /* Handle skip array comparison with IS NOT NULL scan key */
+ if ((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP)
+ {
+ /* Shouldn't generate skip array in presence of IS NULL key */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNULL));
+ Assert((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNOTNULL);
+
+ /* Don't allow skip array to generate IS NULL scan key/element */
+ Assert(array->num_elems == -1);
+ array->null_elem = false;
+
+ /* IS NOT NULL key (could be leftarg or rightarg) now redundant */
+ *result = true;
+ return true;
+ }
+
if (leftarg->sk_flags & SK_ISNULL)
{
Assert(leftarg->sk_flags & (SK_SEARCHNULL | SK_SEARCHNOTNULL));
@@ -3208,6 +4294,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
{
/* Can't make the comparison */
*result = false; /* suppress compiler warnings */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
return false;
}
@@ -3380,13 +4467,6 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
return true;
}
- if (skey->sk_strategy == InvalidStrategy)
- {
- /* Already-eliminated array scan key; don't need to fix anything */
- Assert(skey->sk_flags & SK_SEARCHARRAY);
- return true;
- }
-
/* Adjust strategy for DESC, if we didn't already */
if ((addflags & SK_BT_DESC) && !(skey->sk_flags & SK_BT_DESC))
skey->sk_strategy = BTCommuteStrategyNumber(skey->sk_strategy);
diff --git a/src/backend/access/nbtree/nbtvalidate.c b/src/backend/access/nbtree/nbtvalidate.c
index e9d4cd60d..96d0d9185 100644
--- a/src/backend/access/nbtree/nbtvalidate.c
+++ b/src/backend/access/nbtree/nbtvalidate.c
@@ -114,6 +114,10 @@ btvalidate(Oid opclassoid)
case BTOPTIONS_PROC:
ok = check_amoptsproc_signature(procform->amproc);
break;
+ case BTSKIPSUPPORT_PROC:
+ ok = check_amproc_signature(procform->amproc, VOIDOID, true,
+ 1, 1, INTERNALOID);
+ break;
default:
ereport(INFO,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index b8b5c147c..a86dbf71b 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -1330,6 +1330,31 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("btree equal image functions must not be cross-type")));
}
+ else if (member->number == BTSKIPSUPPORT_PROC)
+ {
+ if (procform->pronargs != 1 ||
+ procform->proargtypes.values[0] != INTERNALOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must accept type \"internal\"")));
+ if (procform->prorettype != VOIDOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must return void")));
+
+ /*
+ * pg_amproc functions are indexed by (lefttype, righttype), but a
+ * skip support function doesn't make sense in cross-type
+ * scenarios. The same opclass opcintype OID is always used for
+ * lefttype and righttype. Providing a cross-type routine isn't
+ * sensible. Reject cross-type ALTER OPERATOR FAMILY ... ADD
+ * FUNCTION 6 statements here.
+ */
+ if (member->lefttype != member->righttype)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must not be cross-type")));
+ }
}
else if (amoid == HASH_AM_OID)
{
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 610ccf2f7..49d792c1c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -96,6 +96,7 @@ OBJS = \
rowtypes.o \
ruleutils.o \
selfuncs.o \
+ skipsupport.o \
tid.o \
timestamp.o \
trigfuncs.o \
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 9c854e0e5..ea3d0f4b5 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -34,6 +34,7 @@
#include "utils/date.h"
#include "utils/datetime.h"
#include "utils/numeric.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
/*
@@ -455,6 +456,39 @@ date_sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+date_decrement(Relation rel, Datum existing)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ Assert(dexisting > DATEVAL_NOBEGIN);
+
+ return DateADTGetDatum(dexisting - 1);
+}
+
+static Datum
+date_increment(Relation rel, Datum existing)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ Assert(dexisting < DATEVAL_NOEND);
+
+ return DateADTGetDatum(dexisting + 1);
+}
+
+Datum
+date_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = date_decrement;
+ sksup->increment = date_increment;
+ sksup->low_elem = DateADTGetDatum(DATEVAL_NOBEGIN);
+ sksup->high_elem = DateADTGetDatum(DATEVAL_NOEND);
+
+ PG_RETURN_VOID();
+}
+
Datum
date_finite(PG_FUNCTION_ARGS)
{
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index 48dbcf59a..4f82f3169 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -83,6 +83,7 @@ backend_sources += files(
'rowtypes.c',
'ruleutils.c',
'selfuncs.c',
+ 'skipsupport.c',
'tid.c',
'timestamp.c',
'trigfuncs.c',
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d..c1df7be9f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6800,6 +6800,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
List *indexBoundQuals;
int indexcol;
bool eqQualHere;
+ bool found_skip;
bool found_saop;
bool found_is_null_op;
double num_sa_scans;
@@ -6825,6 +6826,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
+ found_skip = false;
found_saop = false;
found_is_null_op = false;
num_sa_scans = 1;
@@ -6833,15 +6835,38 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
IndexClause *iclause = lfirst_node(IndexClause, lc);
ListCell *lc2;
+ /*
+ * XXX For now we just cost skip scans via generic rules: make a
+ * uniform assumption that there will be 10 primitive index scans per
+ * skipped attribute, relying on the "1/3 of all index pages" cap that
+ * this costing has used since Postgres 17. Also assume that skipping
+ * won't take place for an index that has fewer than 100 pages.
+ *
+ * The current approach to costing leaves much to be desired, but is
+ * at least better than nothing at all (keeping the code as it is on
+ * HEAD just makes testing and review inconvenient).
+ */
if (indexcol != iclause->indexcol)
{
/* Beginning of a new column's quals */
if (!eqQualHere)
- break; /* done if no '=' qual for indexcol */
+ {
+ found_skip = true; /* skip when no '=' qual for indexcol */
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10;
+ }
eqQualHere = false;
indexcol++;
if (indexcol != iclause->indexcol)
- break; /* no quals at all for indexcol */
+ {
+ /* no quals at all for indexcol */
+ found_skip = true;
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10 * (indexcol - iclause->indexcol);
+ continue;
+ }
}
/* Examine each indexqual associated with this index clause */
@@ -6914,6 +6939,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (index->unique &&
indexcol == index->nkeycolumns - 1 &&
eqQualHere &&
+ !found_skip &&
!found_saop &&
!found_is_null_op)
numIndexTuples = 1.0;
diff --git a/src/backend/utils/adt/skipsupport.c b/src/backend/utils/adt/skipsupport.c
new file mode 100644
index 000000000..9665e4985
--- /dev/null
+++ b/src/backend/utils/adt/skipsupport.c
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.c
+ * Support routines for B-Tree skip scans.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/skipsupport.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "access/nbtree.h"
+#include "utils/lsyscache.h"
+#include "utils/skipsupport.h"
+
+/*
+ * Fill in SkipSupport given an operator class (opfamily + opcintype).
+ *
+ * On success, returns true, and initializes all SkipSupport fields for
+ * caller. Otherwise returns false, indicating that operator class has no
+ * skip support function.
+ */
+bool
+PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype, bool reverse,
+ SkipSupport sksup)
+{
+ Oid skipSupportFunction;
+
+ /* Look for a skip support function */
+ skipSupportFunction = get_opfamily_proc(opfamily, opcintype, opcintype,
+ BTSKIPSUPPORT_PROC);
+ if (!OidIsValid(skipSupportFunction))
+ return false;
+
+ OidFunctionCall1(skipSupportFunction, PointerGetDatum(sksup));
+
+ if (reverse)
+ {
+ Datum low_elem = sksup->low_elem;
+
+ sksup->low_elem = sksup->high_elem;
+ sksup->high_elem = low_elem;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c
index 45eb1b2fe..a9222f896 100644
--- a/src/backend/utils/adt/uuid.c
+++ b/src/backend/utils/adt/uuid.c
@@ -13,12 +13,15 @@
#include "postgres.h"
+#include <limits.h>
+
#include "common/hashfn.h"
#include "lib/hyperloglog.h"
#include "libpq/pqformat.h"
#include "port/pg_bswap.h"
#include "utils/fmgrprotos.h"
#include "utils/guc.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#include "utils/timestamp.h"
#include "utils/uuid.h"
@@ -390,6 +393,68 @@ uuid_abbrev_convert(Datum original, SortSupport ssup)
return res;
}
+static Datum
+uuid_decrement(Relation rel, Datum existing)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] > 0)
+ {
+ uuid->data[i]--;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = UCHAR_MAX;
+ }
+
+ Assert(false);
+
+ return UUIDPGetDatum(uuid);
+}
+
+static Datum
+uuid_increment(Relation rel, Datum existing)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] < UCHAR_MAX)
+ {
+ uuid->data[i]++;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = 0;
+ }
+
+ Assert(false);
+
+ return UUIDPGetDatum(uuid);
+}
+
+Datum
+uuid_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+ pg_uuid_t *uuid_min = palloc(UUID_LEN);
+ pg_uuid_t *uuid_max = palloc(UUID_LEN);
+
+ memset(uuid_min->data, 0x00, UUID_LEN);
+ memset(uuid_max->data, 0xFF, UUID_LEN);
+
+ sksup->decrement = uuid_decrement;
+ sksup->increment = uuid_increment;
+ sksup->low_elem = UUIDPGetDatum(uuid_min);
+ sksup->high_elem = UUIDPGetDatum(uuid_max);
+
+ PG_RETURN_VOID();
+}
+
/* hash index support */
Datum
uuid_hash(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 46c258be2..b84ec2298 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/nbtree.h"
#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
@@ -3523,6 +3524,17 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ /* XXX Remove before commit */
+ {
+ {"skipscan_prefix_cols", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_prefix_cols,
+ INDEX_MAX_KEYS, 0, INDEX_MAX_KEYS,
+ NULL, NULL, NULL
+ },
+
{
/* Can't be set in postgresql.conf */
{"server_version_num", PGC_INTERNAL, PRESET_OPTIONS,
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 2b3997988..9662fb2ba 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -583,6 +583,19 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</para>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><function>skipsupport</function></term>
+ <listitem>
+ <para>
+ Optionally, a btree operator family may provide a <firstterm>skip
+ support</firstterm> function, registered under support function
+ number 6. These functions allow the B-tree code to more efficiently
+ navigate the index structure via an index <quote>skip scan</quote>. The
+ APIs involved in this are defined in
+ <filename>src/include/utils/skipsupport.h</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</sect2>
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 22d8ad1aa..f17dd3456 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -461,6 +461,13 @@
</entry>
<entry>5</entry>
</row>
+ <row>
+ <entry>
+ Return the addresses of C-callable skip support function(s)
+ (optional)
+ </entry>
+ <entry>6</entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -1056,7 +1063,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
FUNCTION 1 btint8cmp(int8, int8) ,
FUNCTION 2 btint8sortsupport(internal) ,
FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint8skipsupport(internal);
CREATE OPERATOR CLASS int4_ops
DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -1069,7 +1077,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
FUNCTION 1 btint4cmp(int4, int4) ,
FUNCTION 2 btint4sortsupport(internal) ,
FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint4skipsupport(internal);
CREATE OPERATOR CLASS int2_ops
DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1082,7 +1091,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
FUNCTION 1 btint2cmp(int2, int2) ,
FUNCTION 2 btint2sortsupport(internal) ,
FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint2skipsupport(internal);
ALTER OPERATOR FAMILY integer_ops USING btree ADD
-- cross-type comparisons int8 vs int2
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index ae54cb254..8b6b775c1 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -362,9 +362,9 @@ ERROR: invalid operator number 0, must be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ERROR: operator argument types must be specified in ALTER OPERATOR FAMILY
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ERROR: invalid function number 0, must be between 1 and 5
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR: invalid function number 6, must be between 1 and 5
+ERROR: invalid function number 0, must be between 1 and 6
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
+ERROR: invalid function number 7, must be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
ERROR: STORAGE cannot be specified in ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 3bbe4c5f9..a8d5be6c1 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5138,9 +5138,10 @@ List of access methods
btree | uuid_ops | uuid | uuid | 1 | uuid_cmp
btree | uuid_ops | uuid | uuid | 2 | uuid_sortsupport
btree | uuid_ops | uuid | uuid | 4 | btequalimage
+ btree | uuid_ops | uuid | uuid | 6 | uuid_skipsupport
hash | uuid_ops | uuid | uuid | 1 | uuid_hash
hash | uuid_ops | uuid | uuid | 2 | uuid_hash_extended
-(5 rows)
+(6 rows)
-- check \dconfig
set work_mem = 10240;
diff --git a/src/test/regress/sql/alter_generic.sql b/src/test/regress/sql/alter_generic.sql
index de58d268d..4246afefd 100644
--- a/src/test/regress/sql/alter_generic.sql
+++ b/src/test/regress/sql/alter_generic.sql
@@ -310,7 +310,7 @@ ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 6 < (int4, int2); -- ope
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 0 < (int4, int2); -- operator number should be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61ad417cd..2aa7c871c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -218,6 +218,7 @@ BTScanPos
BTScanPosData
BTScanPosItem
BTShared
+BTSkipPreproc
BTSortArrayContext
BTSpool
BTStack
@@ -2650,6 +2651,8 @@ SingleBoundSortItem
SinglePartitionSpec
Size
SkipPages
+SkipSupport
+SkipSupportData
SlabBlock
SlabContext
SlabSlot
--
2.45.1
Hi Peter,
Attached is a POC patch that adds skip scan to nbtree. The patch
teaches nbtree index scans to efficiently use a composite index on
'(a, b)' for queries with a predicate such as "WHERE b = 5". This is
feasible in cases where the total number of distinct values in the
column 'a' is reasonably small (think tens or hundreds, perhaps even
thousands for very large composite indexes).[...]
Thoughts?
Many thanks for working on this. I believe it is an important feature
and it would be great to deliver it during the PG18 cycle.
I experimented with the patch and here are the results I got so far.
Firstly, it was compiled on Intel MacOS and ARM Linux. All the tests
pass just fine.
Secondly, I tested the patch manually using a release build on my
Raspberry Pi 5 and the GUCs that can be seen in [1]https://github.com/afiskon/pgscripts/blob/master/single-install-meson.sh.
Test 1 - simple one.
```
CREATE TABLE test1(c char, n bigint);
CREATE INDEX test1_idx ON test1 USING btree(c,n);
INSERT INTO test1
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);
EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;
```
Test 2 - a more complicated one.
```
CREATE TABLE test2(c1 char, c2 char, n bigint);
CREATE INDEX test2_idx ON test2 USING btree(c1,c2,n);
INSERT INTO test2
SELECT chr(ascii('a') + random(0,2)) AS c1,
chr(ascii('a') + random(0,2)) AS c2,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);
EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test2 WHERE n > 900_000_000;
```
Test 3 - to see how it works with covering indexes.
```
CREATE TABLE test3(c char, n bigint, s text DEFAULT 'text_value' || n);
CREATE INDEX test3_idx ON test3 USING btree(c,n) INCLUDE(s);
INSERT INTO test3
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n,
'text_value_' || random(0, 1_000_000_000) AS s
FROM generate_series(0, 1_000_000);
EXPLAIN [ANALYZE] SELECT s FROM test3 WHERE n < 1000;
```
In all the cases the patch worked as expected.
I noticed that with the patch we choose Index Only Scans for Test 1
and without the patch - Parallel Seq Scan. However the Parallel Seq
Scan is 2.4 times faster. Before the patch the query takes 53 ms,
after the patch - 127 ms. I realize this could be just something
specific to my hardware and/or amount of data.
Do you think this is something that was expected or something worth
investigating further?
I haven't looked at the code yet.
[1]: https://github.com/afiskon/pgscripts/blob/master/single-install-meson.sh
--
Best regards,
Aleksander Alekseev
On Tue, Jul 2, 2024 at 8:53 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
CREATE TABLE test1(c char, n bigint);
CREATE INDEX test1_idx ON test1 USING btree(c,n);
The type "char" (note the quotes) is different from char(1). It just
so happens that v1 has support for skipping attributes that use the
default opclass for "char", without support for char(1).
If you change your table definition to CREATE TABLE test1(c "char", n
bigint), then your example queries can use the optimization. This
makes a huge difference.
EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;
For example, this first test query goes from needing a full index scan
that has 5056 buffer hits to a skip scan that requires only 12 buffer
hits.
I noticed that with the patch we choose Index Only Scans for Test 1
and without the patch - Parallel Seq Scan. However the Parallel Seq
Scan is 2.4 times faster. Before the patch the query takes 53 ms,
after the patch - 127 ms.
I'm guessing that it's actually much faster once you change the
leading column to the "char" type/default opclass.
I realize this could be just something
specific to my hardware and/or amount of data.
The selfuncs.c costing current has a number of problems.
One problem is that it doesn't know that some opclasses/types don't
support skipping at all. That particular problem should be fixed on
the nbtree side; nbtree should support skipping regardless of the
opclass that the skipped attribute uses (while still retaining the new
opclass support functions for a subset of types where we expect it to
make skip scans somewhat faster).
--
Peter Geoghegan
On Tue, Jul 2, 2024 at 9:30 AM Peter Geoghegan <pg@bowt.ie> wrote:
EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;
For example, this first test query goes from needing a full index scan
that has 5056 buffer hits to a skip scan that requires only 12 buffer
hits.
Actually, looks like that's an invalid result. The "char" opclass
support function appears to have bugs.
My testing totally focussed on types like integer, date, and UUID. The
"char" opclass was somewhat of an afterthought. Will fix "char" skip
support for v2.
--
Peter Geoghegan
On Tue, Jul 2, 2024 at 9:40 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Jul 2, 2024 at 9:30 AM Peter Geoghegan <pg@bowt.ie> wrote:
EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;
For example, this first test query goes from needing a full index scan
that has 5056 buffer hits to a skip scan that requires only 12 buffer
hits.Actually, looks like that's an invalid result. The "char" opclass
support function appears to have bugs.
Attached v2 fixes this bug. The problem was that the skip support
function used by the "char" opclass assumed signed char comparisons,
even though the authoritative B-Tree comparator (support function 1)
uses signed comparisons (via uint8 casting). A simple oversight. Your
test cases will work with this v2, provided you use "char" (instead of
unadorned char) in the create table statements.
Another small change in v2: I added a DEBUG2 message to nbtree
preprocessing, indicating the number of attributes that we're going to
skip. This provides an intuitive way to see whether the optimizations
are being applied in the first place. That should help to avoid
further confusion like this as the patch continues to evolve.
Support for char(1) doesn't seem feasible within the confines of a
skip support routine. Just like with text (which I touched on in the
introductory email), this will require teaching nbtree to perform
explicit next-key probes. An approach based on explicit probes is
somewhat less efficient in some cases, but it should always work. It's
impractical to write opclass support that (say) increments a char
value 'a' to 'b'. Making that approach work would require extensive
cooperation from the collation provider, and some knowledge of
encoding, which just doesn't make sense (if it's possible at all). I
don't have the problem with "char" because it isn't a collatable type
(it is essentially the same thing as an uint8 integer type, except
that it outputs printable ascii characters).
FWIW, your test cases don't seem like particularly good showcases for
the patch. The queries you came up with require a relatively large
amount of random I/O when accessing the heap, which skip scan will
never help with -- so skip scan is a small win (at least relative to
an unoptimized full index scan). Obviously, no skip scan can ever
avoid any required heap accesses compared to a naive full index scan
(loose index scan *does* have that capability, which is possible only
because it applies semantic information in a way that's very
different).
FWIW, a more sympathetic version of your test queries would have
involved something like "WHERE n = 900_500_000". That would allow the
implementation to perform a series of *selective* primitive index
scans (one primitive index scan per "c" column/char grouping). That
change has the effect of allowing the scan to skip over many
irrelevant leaf pages, which is of course the whole point of skip
scan. It also makes the scan will require far fewer heap accesses, so
heap related costs no longer drown out the nbtree improvements.
--
Peter Geoghegan
Attachments:
v2-0001-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v2-0001-Add-skip-scan-to-nbtree.patchDownload
From d41c1da841e4ab6245caff02d17b945f8346b47b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 16 Apr 2024 13:21:36 -0400
Subject: [PATCH v2] Add skip scan to nbtree.
Skip scan allows nbtree index scans to efficiently use a composite index
on an index (a, b) for queries with a predicate such as "WHERE b = 5".
This is useful in cases where the total number of distinct values in the
column 'a' is reasonably small (think hundreds, possibly thousands).
In effect, a skip scan treats the composite index on (a, b) as if it was
a series of disjunct subindexes -- one subindex per distinct 'a' value.
We exhaustively "search every subindex" using a qual that behaves like
"WHERE a = ANY(<every possible 'a' value>) AND b = 5".
---
src/include/access/nbtree.h | 16 +-
src/include/catalog/pg_amproc.dat | 16 +
src/include/catalog/pg_proc.dat | 24 +
src/include/utils/skipsupport.h | 140 ++
src/backend/access/nbtree/nbtcompare.c | 201 +++
src/backend/access/nbtree/nbtree.c | 10 +-
src/backend/access/nbtree/nbtutils.c | 1399 ++++++++++++++++---
src/backend/access/nbtree/nbtvalidate.c | 4 +
src/backend/commands/opclasscmds.c | 25 +
src/backend/utils/adt/Makefile | 1 +
src/backend/utils/adt/date.c | 34 +
src/backend/utils/adt/meson.build | 1 +
src/backend/utils/adt/selfuncs.c | 30 +-
src/backend/utils/adt/skipsupport.c | 54 +
src/backend/utils/adt/uuid.c | 65 +
src/backend/utils/misc/guc_tables.c | 12 +
doc/src/sgml/btree.sgml | 13 +
doc/src/sgml/xindex.sgml | 16 +-
src/test/regress/expected/alter_generic.out | 6 +-
src/test/regress/expected/psql.out | 3 +-
src/test/regress/sql/alter_generic.sql | 2 +-
src/tools/pgindent/typedefs.list | 3 +
22 files changed, 1899 insertions(+), 176 deletions(-)
create mode 100644 src/include/utils/skipsupport.h
create mode 100644 src/backend/utils/adt/skipsupport.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 749304334..81e99fcc1 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+#include "utils/skipsupport.h"
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -709,7 +710,8 @@ BTreeTupleGetMaxHeapTID(IndexTuple itup)
#define BTINRANGE_PROC 3
#define BTEQUALIMAGE_PROC 4
#define BTOPTIONS_PROC 5
-#define BTNProcs 5
+#define BTSKIPSUPPORT_PROC 6
+#define BTNProcs 6
/*
* We need to be able to tell the difference between read and write
@@ -1032,9 +1034,15 @@ typedef BTScanPosData *BTScanPos;
typedef struct BTArrayKeyInfo
{
int scan_key; /* index of associated key in keyData */
+ int num_elems; /* number of elems (-1 for skip array) */
+
+ /* State used by standard arrays that store elements in memory */
int cur_elem; /* index of current element in elem_values */
- int num_elems; /* number of elems in current array value */
Datum *elem_values; /* array of num_elems Datums */
+
+ /* State used by skip arrays, which generate elements procedurally */
+ SkipSupportData sksup; /* opclass skip scan support */
+ bool null_elem; /* lowest/highest element actually NULL? */
} BTArrayKeyInfo;
typedef struct BTScanOpaqueData
@@ -1123,6 +1131,7 @@ typedef struct BTReadPageState
*/
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
+#define SK_BT_SKIP 0x00040000 /* SK_SEARCHARRAY skip scan key */
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1159,6 +1168,9 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+/* GUC parameter (just a temporary convenience for reviewers) */
+extern PGDLLIMPORT int skipscan_prefix_cols;
+
/*
* external entry points for btree, in nbtree.c
*/
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index f639c3a6a..2a8f6f3f1 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -21,6 +21,8 @@
amprocrighttype => 'bit', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '1', amproc => 'btboolcmp' },
+{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
+ amprocrighttype => 'bool', amprocnum => '6', amproc => 'btboolskipsupport' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
@@ -41,12 +43,16 @@
amprocrighttype => 'char', amprocnum => '1', amproc => 'btcharcmp' },
{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'btcharskipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1', amproc => 'date_cmp' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '2', amproc => 'date_sortsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'date_skipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'timestamp', amprocnum => '1',
amproc => 'date_cmp_timestamp' },
@@ -122,6 +128,8 @@
amprocrighttype => 'int2', amprocnum => '2', amproc => 'btint2sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'btint2skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint24cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
@@ -141,6 +149,8 @@
amprocrighttype => 'int4', amprocnum => '2', amproc => 'btint4sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'btint4skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint48cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
@@ -160,6 +170,8 @@
amprocrighttype => 'int8', amprocnum => '2', amproc => 'btint8sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'btint8skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint84cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
@@ -193,6 +205,8 @@
amprocrighttype => 'oid', amprocnum => '2', amproc => 'btoidsortsupport' },
{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'btoidskipsupport' },
{ amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
amprocrighttype => 'oidvector', amprocnum => '1',
amproc => 'btoidvectorcmp' },
@@ -261,6 +275,8 @@
amprocrighttype => 'uuid', amprocnum => '2', amproc => 'uuid_sortsupport' },
{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'uuid_skipsupport' },
{ amprocfamily => 'btree/record_ops', amproclefttype => 'record',
amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordcmp' },
{ amprocfamily => 'btree/record_image_ops', amproclefttype => 'record',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d4ac578ae..d02dd1a0c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1004,18 +1004,27 @@
{ oid => '3129', descr => 'sort support',
proname => 'btint2sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint2sortsupport' },
+{ oid => '9290', descr => 'skip support',
+ proname => 'btint2skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint2skipsupport' },
{ oid => '351', descr => 'less-equal-greater',
proname => 'btint4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int4 int4', prosrc => 'btint4cmp' },
{ oid => '3130', descr => 'sort support',
proname => 'btint4sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint4sortsupport' },
+{ oid => '9291', descr => 'skip support',
+ proname => 'btint4skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint4skipsupport' },
{ oid => '842', descr => 'less-equal-greater',
proname => 'btint8cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int8 int8', prosrc => 'btint8cmp' },
{ oid => '3131', descr => 'sort support',
proname => 'btint8sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint8sortsupport' },
+{ oid => '9292', descr => 'skip support',
+ proname => 'btint8skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint8skipsupport' },
{ oid => '354', descr => 'less-equal-greater',
proname => 'btfloat4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'float4 float4', prosrc => 'btfloat4cmp' },
@@ -1034,12 +1043,18 @@
{ oid => '3134', descr => 'sort support',
proname => 'btoidsortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btoidsortsupport' },
+{ oid => '9293', descr => 'skip support',
+ proname => 'btoidskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btoidskipsupport' },
{ oid => '404', descr => 'less-equal-greater',
proname => 'btoidvectorcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'oidvector oidvector', prosrc => 'btoidvectorcmp' },
{ oid => '358', descr => 'less-equal-greater',
proname => 'btcharcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'char char', prosrc => 'btcharcmp' },
+{ oid => '9294', descr => 'skip support',
+ proname => 'btcharskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btcharskipsupport' },
{ oid => '359', descr => 'less-equal-greater',
proname => 'btnamecmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'name name', prosrc => 'btnamecmp' },
@@ -2214,6 +2229,9 @@
{ oid => '3136', descr => 'sort support',
proname => 'date_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'date_sortsupport' },
+{ oid => '9295', descr => 'skip support',
+ proname => 'date_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'date_skipsupport' },
{ oid => '4133', descr => 'window RANGE support',
proname => 'in_range', prorettype => 'bool',
proargtypes => 'date date interval bool bool',
@@ -4368,6 +4386,9 @@
{ oid => '1693', descr => 'less-equal-greater',
proname => 'btboolcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'bool bool', prosrc => 'btboolcmp' },
+{ oid => '9296', descr => 'skip support',
+ proname => 'btboolskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btboolskipsupport' },
{ oid => '1688', descr => 'hash',
proname => 'time_hash', prorettype => 'int4', proargtypes => 'time',
@@ -9175,6 +9196,9 @@
{ oid => '3300', descr => 'sort support',
proname => 'uuid_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'uuid_sortsupport' },
+{ oid => '9297', descr => 'skip support',
+ proname => 'uuid_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'uuid_skipsupport' },
{ oid => '2961', descr => 'I/O',
proname => 'uuid_recv', prorettype => 'uuid', proargtypes => 'internal',
prosrc => 'uuid_recv' },
diff --git a/src/include/utils/skipsupport.h b/src/include/utils/skipsupport.h
new file mode 100644
index 000000000..a71a624d0
--- /dev/null
+++ b/src/include/utils/skipsupport.h
@@ -0,0 +1,140 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.h
+ * Support routines for B-Tree skip scans.
+ *
+ * B-Tree operator classes for discrete types can optionally provide a support
+ * function for skipping. This is used during skip scans.
+ *
+ * A B-tree operator class that implements skip support provides B-tree index
+ * scans with a way of enumerating and iterating through every possible value
+ * from the domain of indexable values. This gives scans a way to determine
+ * the next value in line for a given skip array/scan key/skipped attribute.
+ * This happens at the point where the scan determines that another primitive
+ * index scan is required. The next value is used (in combination with at
+ * least one additional lower-order non-skip key, taken from the SQL query) to
+ * relocate the scan, skipping over many irrelevant leaf pages in the process.
+ *
+ * There are many data types/opclasses where implementing a skip support
+ * scheme is inherently impossible (or at least impractical). Obviously, it
+ * would be wrong if the "next" value generated by an opclass was actually
+ * after the true next value (any index tuples with the true next value would
+ * be overlooked by the index scan). This partly explains why opclasses are
+ * under no obligation to implement skip support: a continuous type may have
+ * no way of generating a useful next value.
+ *
+ * Skip scan generally works best with discrete types such as integer, date,
+ * and boolean: types where we expect indexes to contain large groups of
+ * contiguous values (in respect of the leading/skipped index attribute).
+ * When gaps/discontinuities are naturally rare (e.g., a leading identity
+ * column in a composite index, a date column preceding a product_id column),
+ * then it makes sense for the skip scan to optimistically assume that the
+ * next distinct indexable value will find directly matching index tuples.
+ * The B-Tree code can fall back on explicit next-key probes for any opclass
+ * that doesn't include a skip support function, but it's best to provide skip
+ * support whenever possible. The B-Tree code assumes that it's always better
+ * to use the opclass skip support routine where available.
+ *
+ * When a skip scan "bets" that the next indexable value will find an exact
+ * match, there is significant upside, without any accompanying downside.
+ * When this optimistic strategy works out, the scan avoids the cost of an
+ * explicit probe (used in the no-skip-support case to determine the true next
+ * value in the index's skip attribute). When the strategy doesn't work out,
+ * then the scan is no worse off than it would have been without skip support.
+ * The explicit next-key probes used by B-Tree skip scan's fallback path are
+ * very similar to "failed" optimistic searches for the next indexable value
+ * (the next value according to the opclass skip support routine).
+ *
+ * (FIXME Actually, nbtree does no such thing right now, which is considered a
+ * blocker to commit.)
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/skipsupport.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SKIPSUPPORT_H
+#define SKIPSUPPORT_H
+
+#include "utils/relcache.h"
+
+typedef struct SkipSupportData *SkipSupport;
+
+/*
+ * State/callbacks used by skip arrays to procedurally generate elements.
+ *
+ * A BTSKIPSUPPORT_PROC function must set each and every field when called.
+ * If an opclass can only set some of the fields, then it cannot safely
+ * provide a skip support routine (and so must rely on the fallback strategy
+ * used by continuous types, such as numeric).
+ */
+typedef struct SkipSupportData
+{
+ /*
+ * low_elem and high_elem must be set with the lowest and highest possible
+ * values from the domain of indexable values (assuming standard ascending
+ * order). This helps the B-Tree code with finding its initial position
+ * at the leaf level (during the skip scan's first primitive index scan).
+ * In other words, it gives the B-Tree code a useful value to start from,
+ * before any data has been read from the index.
+ *
+ * low_elem and high_elem can also be used to prove that a qual is
+ * unsatisfiable in certain cross-type scenarios.
+ *
+ * low_elem and high_elem are also used by skip scans to determine when
+ * they've reached the final possible value (in the current direction).
+ * It's typical for the scan to run out of leaf pages before it runs out
+ * of unscanned indexable values, but it's still useful for the scan to
+ * have a way to recognize when it has reached the last possible value
+ * (this saves us a useless probe that just lands on the final leaf page).
+ *
+ * Note: the logic for determining that the scan has reached the final
+ * possible value naturally belongs in the B-Tree code. The final value
+ * isn't necessarily the original high_elem/low_elem set by the opclass.
+ * In particular, it'll be a lower/higher value when B-Tree preprocessing
+ * determines that the true range of possible values should be restricted,
+ * due to the presence of an inequality applied to the index's skipped
+ * attribute. These are range skip scans.
+ */
+ Datum low_elem; /* lowest sorting/leftmost non-NULL value */
+ Datum high_elem; /* highest sorting/rightmost non-NULL value */
+
+ /*
+ * Decrement/increment functions.
+ *
+ * Returns a decremented/incremented copy of caller's existing datum,
+ * allocated in caller's memory context (in the case of pass-by-reference
+ * types). It's not okay for these functions to leak any memory.
+ *
+ * Both decrement and increment callbacks are guaranteed to never be
+ * called with a NULL "existing" arg. (In general it is the B-Tree code's
+ * job to worry about NULLs, and about whether indexed values are stored
+ * in ASC order or DESC order.)
+ *
+ * The decrement callback is guaranteed to only be called with an
+ * "existing" value that's strictly > the low_elem set by the opclass.
+ * Similarly, the increment callback is guaranteed to only be called with
+ * an "existing" value that's strictly < the high_elem set by the opclass.
+ * Consequently, opclasses don't have to deal with "overflow" themselves
+ * (though asserting that the B-Tree code got it right is a good idea).
+ *
+ * It's quite possible (and very common) for the B-Tree skip scan caller's
+ * "existing" datum to just be a straight copy of a value that it copied
+ * from the index. Operator classes must be liberal in accepting every
+ * possible representational variation within the underlying data type.
+ * Opclasses don't have to preserve whatever semantically insignificant
+ * information the data type might be carrying around, though.
+ *
+ * Note: < and > are defined by the opclass's ORDER proc in the usual way.
+ */
+ Datum (*decrement) (Relation rel, Datum existing);
+ Datum (*increment) (Relation rel, Datum existing);
+} SkipSupportData;
+
+extern bool PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype,
+ bool reverse, SkipSupport sksup);
+
+#endif /* SKIPSUPPORT_H */
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 1c72867c8..48a877613 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -58,6 +58,7 @@
#include <limits.h>
#include "utils/fmgrprotos.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#ifdef STRESS_SORT_INT_MIN
@@ -78,6 +79,39 @@ btboolcmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32((int32) a - (int32) b);
}
+static Datum
+bool_decrement(Relation rel, Datum existing)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ Assert(bexisting == true);
+
+ return BoolGetDatum(bexisting - 1);
+}
+
+static Datum
+bool_increment(Relation rel, Datum existing)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ Assert(bexisting == false);
+
+ return BoolGetDatum(bexisting + 1);
+}
+
+Datum
+btboolskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = bool_decrement;
+ sksup->increment = bool_increment;
+ sksup->low_elem = BoolGetDatum(false);
+ sksup->high_elem = BoolGetDatum(true);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint2cmp(PG_FUNCTION_ARGS)
{
@@ -105,6 +139,39 @@ btint2sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int2_decrement(Relation rel, Datum existing)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ Assert(iexisting > PG_INT16_MIN);
+
+ return Int16GetDatum(iexisting - 1);
+}
+
+static Datum
+int2_increment(Relation rel, Datum existing)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ Assert(iexisting < PG_INT16_MAX);
+
+ return Int16GetDatum(iexisting + 1);
+}
+
+Datum
+btint2skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int2_decrement;
+ sksup->increment = int2_increment;
+ sksup->low_elem = Int16GetDatum(PG_INT16_MIN);
+ sksup->high_elem = Int16GetDatum(PG_INT16_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint4cmp(PG_FUNCTION_ARGS)
{
@@ -128,6 +195,39 @@ btint4sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int4_decrement(Relation rel, Datum existing)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ Assert(iexisting > PG_INT32_MIN);
+
+ return Int32GetDatum(iexisting - 1);
+}
+
+static Datum
+int4_increment(Relation rel, Datum existing)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ Assert(iexisting < PG_INT32_MAX);
+
+ return Int32GetDatum(iexisting + 1);
+}
+
+Datum
+btint4skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int4_decrement;
+ sksup->increment = int4_increment;
+ sksup->low_elem = Int32GetDatum(PG_INT32_MIN);
+ sksup->high_elem = Int32GetDatum(PG_INT32_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint8cmp(PG_FUNCTION_ARGS)
{
@@ -171,6 +271,39 @@ btint8sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int8_decrement(Relation rel, Datum existing)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ Assert(iexisting > PG_INT64_MIN);
+
+ return Int64GetDatum(iexisting - 1);
+}
+
+static Datum
+int8_increment(Relation rel, Datum existing)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ Assert(iexisting < PG_INT64_MAX);
+
+ return Int64GetDatum(iexisting + 1);
+}
+
+Datum
+btint8skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int8_decrement;
+ sksup->increment = int8_increment;
+ sksup->low_elem = Int64GetDatum(PG_INT64_MIN);
+ sksup->high_elem = Int64GetDatum(PG_INT64_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint48cmp(PG_FUNCTION_ARGS)
{
@@ -292,6 +425,39 @@ btoidsortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+oid_decrement(Relation rel, Datum existing)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ Assert(oexisting > InvalidOid);
+
+ return ObjectIdGetDatum(oexisting - 1);
+}
+
+static Datum
+oid_increment(Relation rel, Datum existing)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ Assert(oexisting < OID_MAX);
+
+ return ObjectIdGetDatum(oexisting + 1);
+}
+
+Datum
+btoidskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = oid_decrement;
+ sksup->increment = oid_increment;
+ sksup->low_elem = ObjectIdGetDatum(InvalidOid);
+ sksup->high_elem = ObjectIdGetDatum(OID_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btoidvectorcmp(PG_FUNCTION_ARGS)
{
@@ -325,3 +491,38 @@ btcharcmp(PG_FUNCTION_ARGS)
/* Be careful to compare chars as unsigned */
PG_RETURN_INT32((int32) ((uint8) a) - (int32) ((uint8) b));
}
+
+static Datum
+char_decrement(Relation rel, Datum existing)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ Assert(cexisting > 0);
+
+ return CharGetDatum((uint8) cexisting - 1);
+}
+
+static Datum
+char_increment(Relation rel, Datum existing)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ Assert(cexisting < UCHAR_MAX);
+
+ return CharGetDatum((uint8) cexisting + 1);
+}
+
+Datum
+btcharskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = char_decrement;
+ sksup->increment = char_increment;
+
+ /* btcharcmp compares chars as unsigned */
+ sksup->low_elem = UInt8GetDatum(0);
+ sksup->high_elem = UInt8GetDatum(UCHAR_MAX);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f..9c9cd48f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -324,11 +324,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
BTScanPosInvalidate(so->currPos);
BTScanPosInvalidate(so->markPos);
- if (scan->numberOfKeys > 0)
- so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
- else
- so->keyData = NULL;
+ so->keyData = NULL;
so->needPrimScan = false;
so->scanBehind = false;
so->arrayKeys = NULL;
@@ -408,6 +405,11 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->numberOfKeys = 0; /* until _bt_preprocess_keys sets it */
so->numArrayKeys = 0; /* ditto */
+
+ /* Release private storage allocated in previous btrescan, if any */
+ if (so->keyData != NULL)
+ pfree(so->keyData);
+ so->keyData = NULL;
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d6de2072d..f4442d014 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -28,10 +28,44 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/skipsupport.h"
+
+/*
+ * GUC parameter (temporary convenience for reviewers).
+ *
+ * To disable all skipping, set skipscan_prefix_cols=0. Otherwise set it to
+ * the attribute number that you wish to make the last attribute number that
+ * we can add a skip scan key for.
+ *
+ * For example, setting skipscan_prefix_cols=1 before an index scan with qual
+ * "WHERE b = 1 AND c > 42" will make us generate a skip scan key on the
+ * column 'a' (which is attnum 1) only, preventing us from adding one for the
+ * column 'c' (and so 'c' will still have an inequality scan key, required in
+ * only one direction -- 'c' won't be output as a "range" skip key/array).
+ *
+ * The same scan keys will be output when skipscan_prefix_cols=2, given the
+ * same query/qual, since we naturally get a required equality scan key on 'b'
+ * from the input scan keys (provided we at least manage to add a skip scan
+ * key on 'a' that "anchors its required-ness" to the 'b' scan key.)
+ *
+ * When skipscan_prefix_cols is set to the number of key columns in the index,
+ * we're as aggressive as possible about adding skip scan arrays/scan keys.
+ * This is the current default behavior, and the behavior we're targeting for
+ * the committed patch (if there are slowdowns from being maximally aggressive
+ * here then the likely solution is to make _bt_advance_array_keys adaptive,
+ * rather than trying to predict what will work during preprocessing).
+ */
+int skipscan_prefix_cols;
#define LOOK_AHEAD_REQUIRED_RECHECKS 3
#define LOOK_AHEAD_DEFAULT_DISTANCE 5
+typedef struct BTSkipPreproc
+{
+ SkipSupportData sksup; /* opclass skip scan support */
+ Oid eq_op; /* InvalidOid means don't skip */
+} BTSkipPreproc;
+
typedef struct BTSortArrayContext
{
FmgrInfo *sortproc;
@@ -62,18 +96,49 @@ static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
ScanKey arraysk, ScanKey skey,
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok);
-static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys);
static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
+static int _bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts);
+static bool _bt_skip_support(Relation rel, int add_skip_attno,
+ BTSkipPreproc *skipatts);
+static inline Datum _bt_apply_decrement(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array);
+static inline Datum _bt_apply_increment(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur);
+ Datum arrdatum, bool arrnull,
+ ScanKey cur);
+static void _bt_apply_compare_array(ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
+static void _bt_apply_compare_skiparray(IndexScanDesc scan, ScanKey arraysk,
+ ScanKey skey, FmgrInfo *orderproc,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
static int _bt_binsrch_array_skey(FmgrInfo *orderproc,
bool cur_elem_trig, ScanDirection dir,
Datum tupdatum, bool tupnull,
BTArrayKeyInfo *array, ScanKey cur,
int32 *set_elem_result);
+static void _bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result);
+static void _bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_low_or_high(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array, bool low_not_high);
+static void _bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull);
+static void _bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_advance_skip_array_key_increment(Relation rel, ScanDirection dir,
+ BTArrayKeyInfo *array, ScanKey skey,
+ FmgrInfo *orderproc);
static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
@@ -251,9 +316,6 @@ _bt_freestack(BTStack stack)
* It is convenient for _bt_preprocess_keys caller to have to deal with no
* more than one equality strategy array scan key per index attribute. We'll
* always be able to set things up that way when complete opfamilies are used.
- * Eliminated array scan keys can be recognized as those that have had their
- * sk_strategy field set to InvalidStrategy here by us. Caller should avoid
- * including these in the scan's so->keyData[] output array.
*
* We set the scan key references from the scan's BTArrayKeyInfo info array to
* offsets into the temp modified input array returned to caller. Scans that
@@ -261,18 +323,36 @@ _bt_freestack(BTStack stack)
* preprocessing steps are complete. This will convert the scan key offset
* references into references to the scan's so->keyData[] output scan keys.
*
+ * We're also responsible for generating skip arrays (and their associated
+ * scan keys) here. This enables skip scan. We do this for index attributes
+ * that initially lacked an equality condition within scan->keyData[], iff
+ * doing so allows a later scan key (that was passed to us in scan->keyData[])
+ * to be marked required by later preprocessing on output.
+ * _bt_decide_skipatts decides which attributes receive skip arrays.
+ *
+ * Caller must pass *numberOfKeys to give us a way to change the number of
+ * input scan keys (our output is caller's input). The returned array can be
+ * smaller than scan->keyData[] when we eliminated a redundant array scan key
+ * (redundant with some other array scan key, for the same attribute). It can
+ * also be larger when we added a skip array/skip scan key. Caller uses this
+ * to allocate so->keyData[] for the current btrescan.
+ *
* Note: the reason we need to return a temp scan key array, rather than just
* scribbling on scan->keyData, is that callers are permitted to call btrescan
* without supplying a new set of scankey data.
*/
static ScanKey
-_bt_preprocess_array_keys(IndexScanDesc scan)
+_bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- int numberOfKeys = scan->numberOfKeys;
+ int numArrayKeyData = scan->numberOfKeys;
int16 *indoption = rel->rd_indoption;
- int numArrayKeys;
+ BTSkipPreproc skipatts[INDEX_MAX_KEYS];
+ int numArrayKeys,
+ numSkipArrayKeys,
+ output_ikey = 0;
+ AttrNumber attno_skip = 1;
int origarrayatt = InvalidAttrNumber,
origarraykey = -1;
Oid origelemtype = InvalidOid;
@@ -280,11 +360,14 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
MemoryContext oldContext;
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- Assert(numberOfKeys);
+ Assert(scan->numberOfKeys);
- /* Quick check to see if there are any array keys */
+ /*
+ * Quick check to see if there are any array keys, or any missing keys we
+ * can generate a "skip scan" array key for ourselves
+ */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int i = 0; i < scan->numberOfKeys; i++)
{
cur = &scan->keyData[i];
if (cur->sk_flags & SK_SEARCHARRAY)
@@ -300,6 +383,18 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
}
}
+ /* Consider generating skip arrays, and associated equality scan keys */
+ numSkipArrayKeys = _bt_decide_skipatts(scan, skipatts);
+ if (numSkipArrayKeys)
+ {
+ /* At least one skip array scan key must be added to arrayKeyData[] */
+ numArrayKeys += numSkipArrayKeys;
+ /* output scan key buffer allocation needs space for skip scan keys */
+ numArrayKeyData += numSkipArrayKeys;
+
+ elog(DEBUG2, "skipping %d index attributes", numSkipArrayKeys);
+ }
+
/* Quit if nothing to do. */
if (numArrayKeys == 0)
return NULL;
@@ -317,19 +412,23 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
oldContext = MemoryContextSwitchTo(so->arrayContext);
- /* Create modifiable copy of scan->keyData in the workspace context */
- arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
- memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
+ /* Create output scan keys in the workspace context */
+ arrayKeyData = (ScanKey) palloc(numArrayKeyData * sizeof(ScanKeyData));
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
/* Allocate space for ORDER procs used to help _bt_checkkeys */
- so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+ so->orderProcs = (FmgrInfo *) palloc(numArrayKeyData * sizeof(FmgrInfo));
- /* Now process each array key */
+ /*
+ * Process each array key, and generate skip arrays as needed. Also copy
+ * every scan->keyData[] input scan key (whether it's an array or not)
+ * into the arrayKeyData array we'll return to our caller (barring any
+ * array scan keys that we could eliminate early through array merging).
+ */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int input_ikey = 0; input_ikey < scan->numberOfKeys; input_ikey++)
{
FmgrInfo sortproc;
FmgrInfo *sortprocp = &sortproc;
@@ -345,14 +444,78 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int num_nonnulls;
int j;
- cur = &arrayKeyData[i];
- if (!(cur->sk_flags & SK_SEARCHARRAY))
- continue;
+ /* Create a skip array and scan key where indicated by skipatts */
+ while (numSkipArrayKeys &&
+ attno_skip <= scan->keyData[input_ikey].sk_attno)
+ {
+ Oid opcintype = rel->rd_opcintype[attno_skip - 1];
+ Oid collation = rel->rd_indcollation[attno_skip - 1];
+ Oid eq_op = skipatts[attno_skip - 1].eq_op;
+ RegProcedure cmp_proc;
+
+ if (!OidIsValid(eq_op))
+ {
+ /* won't skip using this attribute */
+ attno_skip++;
+ continue;
+ }
+
+ cmp_proc = get_opcode(eq_op);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing oprcode for skipping equals operator %u", eq_op);
+
+ cur = &arrayKeyData[output_ikey];
+ Assert(attno_skip <= scan->keyData[input_ikey].sk_attno);
+ ScanKeyEntryInitialize(cur,
+ SK_SEARCHARRAY | SK_BT_SKIP, /* flags */
+ attno_skip, /* skipped att number */
+ BTEqualStrategyNumber, /* equality strategy */
+ InvalidOid, /* opclass input subtype */
+ collation, /* index column's collation */
+ cmp_proc, /* equality operator's proc */
+ (Datum) 0); /* constant */
+
+ /* Initialize array fields */
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
+ so->arrayKeys[numArrayKeys].num_elems = -1;
+ so->arrayKeys[numArrayKeys].cur_elem = 0;
+ so->arrayKeys[numArrayKeys].elem_values = NULL; /* unusued */
+ so->arrayKeys[numArrayKeys].sksup = skipatts[attno_skip - 1].sksup;
+ so->arrayKeys[numArrayKeys].null_elem = true; /* for now */
+
+ /*
+ * We'll need a 3-way ORDER proc to determine when and how the
+ * consed-up "array" will advance inside _bt_advance_array_keys.
+ * Set one up now.
+ */
+ _bt_setup_array_cmp(scan, cur, opcintype,
+ &so->orderProcs[output_ikey], NULL);
+
+ /*
+ * Prepare to output next scan key (might be another skip scan
+ * key, or it could be an input scan key from scan->keyData[])
+ */
+ numSkipArrayKeys--;
+ numArrayKeys++;
+ attno_skip++;
+ output_ikey++; /* keep this scan key/array */
+ }
/*
- * First, deconstruct the array into elements. Anything allocated
- * here (including a possibly detoasted array value) is in the
- * workspace context.
+ * Copy input scan key into temp arrayKeyData scan key array. (From
+ * here on, cur points at our copy of the input scan key.)
+ */
+ cur = &arrayKeyData[output_ikey];
+ *cur = scan->keyData[input_ikey];
+
+ if (!(cur->sk_flags & SK_SEARCHARRAY))
+ {
+ output_ikey++; /* keep this non-array scan key */
+ continue;
+ }
+
+ /*
+ * Deconstruct the array into elements
*/
arrayval = DatumGetArrayTypeP(cur->sk_argument);
/* We could cache this data, but not clear it's worth it */
@@ -406,6 +569,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTGreaterStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
case BTEqualStrategyNumber:
/* proceed with rest of loop */
@@ -416,6 +580,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTLessStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
default:
elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -432,7 +597,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* sortproc just points to the same proc used during binary searches.
*/
_bt_setup_array_cmp(scan, cur, elemtype,
- &so->orderProcs[i], &sortprocp);
+ &so->orderProcs[output_ikey], &sortprocp);
/*
* Sort the non-null elements and eliminate any duplicates. We must
@@ -476,11 +641,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
break;
}
- /*
- * Indicate to _bt_preprocess_keys caller that it must ignore
- * this scan key
- */
- cur->sk_strategy = InvalidStrategy;
+ /* Throw away this array */
continue;
}
@@ -511,12 +672,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* Note: _bt_preprocess_array_keys_final will fix-up each array's
* scan_key field later on, after so->keyData[] has been finalized.
*/
- so->arrayKeys[numArrayKeys].scan_key = i;
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
+ so->arrayKeys[numArrayKeys].null_elem = false; /* unused */
numArrayKeys++;
+ output_ikey++; /* keep this scan key/array */
}
+ /* Set final number of arrayKeyData[] keys, array keys */
+ *numberOfKeys = output_ikey;
so->numArrayKeys = numArrayKeys;
MemoryContextSwitchTo(oldContext);
@@ -624,7 +789,8 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
{
BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
- Assert(array->num_elems > 0);
+ Assert(array->num_elems > 0 || array->num_elems == -1);
+ Assert(array->num_elems != -1 || outkey->sk_flags & SK_BT_REQFWD);
if (array->scan_key == input_ikey)
{
@@ -685,6 +851,245 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
so->numArrayKeys, INDEX_MAX_KEYS)));
}
+/*
+ * _bt_decide_skipatts() -- set index attributes requiring skip arrays
+ *
+ * _bt_preprocess_array_keys helper function. Determines which attributes
+ * will require skip arrays/scan keys. Also sets up skip support function for
+ * each of these attributes.
+ *
+ * This sets up "skip scan". Adding skip arrays (and associated scan keys)
+ * allows _bt_preprocess_keys to mark lower-order scan keys (copied from the
+ * original scan->keyData[] array in the conventional way) as required. The
+ * overall effect is to enable skipping over irrelevant sections of the index.
+ *
+ * Return value is the total number of scan keys to add as "input" scan keys
+ * for further processing within _bt_preprocess_keys.
+ */
+static int
+_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)
+{
+ Relation rel = scan->indexRelation;
+ ScanKey inputsk;
+ AttrNumber attno_inputsk = 1,
+ attno_skip = 1;
+ bool attno_has_equal = false,
+ attno_has_rowcompare = false;
+ int numSkipArrayKeys = 0,
+ prev_numSkipArrayKeys = 0;
+
+ Assert(scan->numberOfKeys);
+
+ /*
+ * XXX Don't support system catalogs for now. Calls to routines like
+ * get_opfamily_member() are prone to infinite recursion, which we'll need
+ * to find workaround for (hard-coded lookups?).
+ */
+ if (IsCatalogRelation(rel))
+ return 0;
+
+ /*
+ * FIXME Also don't support parallel scans for now. Must add logic to
+ * places like _bt_parallel_primscan_schedule so that we account for skip
+ * arrays when parallel workers serialize their array scan state.
+ */
+ if (scan->parallel_scan)
+ return 0;
+
+ inputsk = &scan->keyData[0];
+ for (int i = 0;; inputsk++, i++)
+ {
+ /*
+ * Backfill skip arrays for any wholly omitted attributes prior to
+ * attno_inputsk
+ */
+ while (attno_skip < attno_inputsk)
+ {
+ if (!_bt_skip_support(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /*
+ * Opclass lacks a suitable skip support routine.
+ *
+ * Return prev_numSkipArrayKeys, so as to avoid including any
+ * "backfilled" arrays that were supposed to form a contiguous
+ * group with a skip array on this attribute. There is no
+ * benefit to adding backfill skip arrays unless we can do so
+ * for all attributes (all attributes up to and including the
+ * one immediately before attno_inputsk).
+ */
+ return prev_numSkipArrayKeys;
+ }
+
+ /* plan on adding a backfill skip array for this attribute */
+ numSkipArrayKeys++;
+ attno_skip++;
+ }
+
+ /*
+ * Stop once past the final input scan key. We deliberately never add
+ * a skip attribute for the attribute of the last input scan key.
+ *
+ * If the last input scan key(s) use equality strategy, then a skip
+ * attribute is superfluous at best. If the last input scan key uses
+ * an inequality strategy, then adding a skip scan array/scan key is a
+ * valid though suboptimal transformation. It is better to arrange
+ * for preprocessing to allow such an input inequality scan key to
+ * remain an inequality on output. That way _bt_checkkeys will be
+ * able to make best use of both of its precheck optimizations, but
+ * _bt_first will be no less capable of efficiently finding the
+ * starting position for each primitive index scan.
+ */
+ if (i >= scan->numberOfKeys)
+ break;
+
+ /*
+ * Cannot keep adding skip arrays after a RowCompare
+ */
+ if (attno_has_rowcompare)
+ break;
+
+ /*
+ * Apply temporary testing GUC that can be used to disable skipping
+ * (either in part or in whole)
+ */
+ if (attno_inputsk > skipscan_prefix_cols)
+ break;
+
+ /*
+ * Now consider next attno_inputsk (or keep going if this is an
+ * additional scan key against the same attribute)
+ */
+ if (attno_inputsk < inputsk->sk_attno)
+ {
+ prev_numSkipArrayKeys = numSkipArrayKeys;
+
+ /*
+ * Now add skip array for previous scan key's attribute, though
+ * only if the attribute has no equality strategy scan keys.
+ *
+ * Adding skip arrays to an attribute that has one or more
+ * inequality scan keys will cause preprocessing to output a range
+ * skip array. This will happen when preprocessing proper deals
+ * with the redundancy between the array and its inequalities.
+ */
+ skipatts[attno_skip - 1].eq_op = InvalidOid;
+ if (!attno_has_equal)
+ {
+ /* Only saw inequalities for the prior attribute */
+ if (_bt_skip_support(rel, attno_skip,
+ &skipatts[attno_skip - 1]))
+ {
+ /* add a range skip array for this attribute */
+ numSkipArrayKeys++;
+ }
+ else
+ break;
+ }
+ else
+ {
+ /*
+ * Saw an equality for the prior attribute, so it doesn't need
+ * a skip array (not even a range skip array). We'll be able
+ * to add later skip arrays, too (doesn't matter if the prior
+ * attribute uses an input opclass without skip support).
+ */
+ }
+
+ /* Set things up for this new attribute */
+ attno_skip++;
+ attno_inputsk = inputsk->sk_attno;
+ attno_has_equal = false;
+ }
+
+ /*
+ * Track if this scan key's attribute has any equality strategy scan
+ * keys.
+ *
+ * Treat IS NULL scan keys as using equal strategy (they'll be marked
+ * as using it later on, by _bt_fix_scankey_strategy).
+ */
+ if (inputsk->sk_strategy == BTEqualStrategyNumber ||
+ (inputsk->sk_flags & SK_SEARCHNULL))
+ attno_has_equal = true;
+
+ /*
+ * We don't support RowCompare transformation. Remember that we saw a
+ * RowCompare, so that we don't keep adding skip attributes.
+ *
+ * We do still backfill skip attributes before the RowCompare, so that
+ * it can be marked required. This is similar to what happens when a
+ * conventional inequality uses an opclass that lacks skip support.
+ */
+ if (inputsk->sk_flags & SK_ROW_HEADER)
+ attno_has_rowcompare = true;
+ }
+
+ return numSkipArrayKeys;
+}
+
+/*
+ * _bt_skip_support() -- set up skip support function in *skipatts
+ *
+ * Returns true on success, indicating that we set *skipatts with input
+ * opclass's skip support routine for caller. Otherwise returns false.
+ */
+static bool
+_bt_skip_support(Relation rel, int add_skip_attno, BTSkipPreproc *skipatts)
+{
+ int16 *indoption = rel->rd_indoption;
+ Oid opfamily = rel->rd_opfamily[add_skip_attno - 1];
+ Oid opcintype = rel->rd_opcintype[add_skip_attno - 1];
+ bool reverse;
+
+ /* Look up input opclass's equality operator */
+ skipatts->eq_op = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ /*
+ * We don't expect input opclasses lacking even an equality operator, but
+ * it's possible. Deal with it gracefully.
+ */
+ if (!OidIsValid(skipatts->eq_op))
+ return false;
+
+ /* Have skip support infrastructure set all SkipSupport fields */
+ reverse = (indoption[add_skip_attno - 1] & INDOPTION_DESC) != 0;
+ return PrepareSkipSupportFromOpclass(opfamily, opcintype, reverse,
+ &skipatts->sksup);
+}
+
+/*
+ * _bt_apply_decrement() -- Get a decremented copy of skey's arg
+ *
+ * Note: this wrapper function calls the opclass increment function when the
+ * index stores values in descending order. We're "logically decrementing" to
+ * the previous value in the key space regardless.
+ */
+static inline Datum
+_bt_apply_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ if (!(skey->sk_flags & SK_BT_DESC))
+ return array->sksup.decrement(rel, skey->sk_argument);
+ else
+ return array->sksup.increment(rel, skey->sk_argument);
+}
+
+/*
+ * _bt_apply_increment() -- Get an incremented copy of skey's arg
+ *
+ * Note: this wrapper function calls the opclass decrement function when the
+ * index stores values in descending order. We're "logically incrementing" to
+ * the next value in the key space regardless.
+ */
+static inline Datum
+_bt_apply_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ if (!(skey->sk_flags & SK_BT_DESC))
+ return array->sksup.increment(rel, skey->sk_argument);
+ else
+ return array->sksup.decrement(rel, skey->sk_argument);
+}
+
/*
* _bt_setup_array_cmp() -- Set up array comparison functions
*
@@ -979,15 +1384,10 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
{
Relation rel = scan->indexRelation;
Oid opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
- int cmpresult = 0,
- cmpexact = 0,
- matchelem,
- new_nelems = 0;
FmgrInfo crosstypeproc;
FmgrInfo *orderprocp = orderproc;
Assert(arraysk->sk_attno == skey->sk_attno);
- Assert(array->num_elems > 0);
Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
arraysk->sk_strategy == BTEqualStrategyNumber);
@@ -1000,8 +1400,8 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
* datum of opclass input type for the index's attribute (on-disk type).
* We can reuse the array's ORDER proc whenever the non-array scan key's
* type is a match for the corresponding attribute's input opclass type.
- * Otherwise, we have to do another ORDER proc lookup so that our call to
- * _bt_binsrch_array_skey applies the correct comparator.
+ * Otherwise, we have to do another ORDER proc lookup. We have to be sure
+ * that _bt_compare_array_skey/_bt_binsrch_array_skey use the right proc.
*
* Note: we have to support the convention that sk_subtype == InvalidOid
* means the opclass input type; this is a hack to simplify life for
@@ -1032,11 +1432,46 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
return false;
}
- /* We have all we need to determine redundancy/contradictoriness */
orderprocp = &crosstypeproc;
fmgr_info(cmp_proc, orderprocp);
}
+ /*
+ * We have all we need to determine redundancy/contradictoriness.
+ *
+ * Perform preprocessing of the array based on whether it's a conventional
+ * array, or a skip array. Sets *qual_ok correctly in passing.
+ */
+ if (array->num_elems != -1)
+ _bt_apply_compare_array(arraysk, skey,
+ orderprocp, array, qual_ok);
+ else
+ _bt_apply_compare_skiparray(scan, arraysk, skey,
+ orderproc, orderprocp,
+ array, qual_ok);
+
+ return true;
+}
+
+/*
+ * Finish off preprocessing of conventional (non-skip) array scan key when it
+ * is redundant with (or contradicted by) a non-array scalar scan key.
+ *
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ */
+static void
+_bt_apply_compare_array(ScanKey arraysk, ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ int cmpresult = 0,
+ cmpexact = 0,
+ matchelem,
+ new_nelems = 0;
+
+ Assert(array->num_elems > 0);
+ Assert(!(arraysk->sk_flags & SK_BT_SKIP));
+
matchelem = _bt_binsrch_array_skey(orderprocp, false,
NoMovementScanDirection,
skey->sk_argument, false, array,
@@ -1088,8 +1523,152 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
array->num_elems = new_nelems;
*qual_ok = new_nelems > 0;
+}
- return true;
+/*
+ * Finish off preprocessing of skip array scan key when it is redundant with
+ * (or contradicted by) a non-array scalar scan key.
+ *
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Arrays used to skip (skip scan/missing key attribute predicates) work by
+ * procedurally generating their elements on the fly. We must still
+ * "eliminate contradictory elements", but it works a little differently: we
+ * narrow the range of the skip array, such that the array will never
+ * generated contradicted-by-skey elements.
+ *
+ * FIXME Our behavior in scenarios with cross-type operators (range skip scan
+ * cases) is buggy. We're naively copying datums of a different type from
+ * scalar inequality scan keys into the array's low_value and high_value
+ * fields. In practice this tends to not visibly break (in practice types
+ * that appear within the same operator family tend to have compatible datum
+ * representations, at least on systems with little-endian byte order). Put
+ * off dealing with the problem until a later revision of the patch.
+ *
+ * It seems likely that the best way to fix this problem will involve keeping
+ * around the original operator in the BTArrayKeyInfo array struct whenever
+ * we're passed a "redundant" cross-type inequality operator (an approach
+ * involving casts/coercions might be tempting, but seems much too fragile).
+ * We only need to use not-column-input-opclass-type operators for the first
+ * and/or last array elements from the skip array under this scheme; we'll
+ * still mostly be dealing with opcintype-typed datums, copied from the index
+ * (as well as incrementing/decrementing copies of those index tuple datums).
+ * Importantly, this scheme should work just as well with an opfamily that
+ * doesn't even have an orderprocp cross-type ORDER operator to pass us here
+ * (we might even have to keep more than one same-strategy inequality, since
+ * in general _bt_preprocess_keys might not be able to prove which inequality
+ * is redundant).
+ */
+static void
+_bt_apply_compare_skiparray(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderproc, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ Relation rel = scan->indexRelation;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Form_pg_attribute attr = TupleDescAttr(RelationGetDescr(rel),
+ skey->sk_attno - 1);
+ MemoryContext oldContext;
+ int cmpresult;
+
+ /*
+ * We don't expect to have to deal with NULLs in non-array/non-skip scan
+ * key. We expect _bt_preprocess_array_keys to avoid generating a skip
+ * array for an index attribute with an IS NULL input scan key. It will
+ * still do so in the presence of IS NOT NULL input scan keys, but
+ * _bt_compare_scankey_args is expected to handle those for us.
+ */
+ Assert(arraysk->sk_flags & SK_BT_SKIP);
+ Assert(arraysk->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_ISNULL));
+ Assert(array->num_elems == -1);
+
+ /*
+ * Scalar scan key must be a B-Tree operator, which must always be strict.
+ * Array shouldn't generate a NULL "array element"/an IS NULL qual. This
+ * isn't just an optimization; it's strictly necessary for correctness.
+ */
+ array->null_elem = false;
+
+ switch (skey->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+
+ /*
+ * detect if scan key argument will be < low_value once
+ * decremented
+ */
+ cmpresult = _bt_compare_array_skey(orderprocp,
+ skey->sk_argument, false,
+ array->sksup.low_elem, false,
+ arraysk);
+ if (cmpresult <= 0)
+ {
+ /* decrementing would make qual unsatisfiable, so don't try */
+ *qual_ok = false;
+ return;
+ }
+
+ /* decremented scan key value becomes skip array's new high_value */
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.high_elem = _bt_apply_decrement(rel, skey, array);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ case BTLessEqualStrategyNumber:
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.high_elem = datumCopy(skey->sk_argument,
+ attr->attbyval, attr->attlen);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ case BTEqualStrategyNumber:
+ /* _bt_preprocess_array_keys should have avoided this */
+ elog(ERROR, "equality strategy scan key conflicts with skip key for attribute %d on index \"%s\"",
+ skey->sk_attno, RelationGetRelationName(rel));
+ break;
+ case BTGreaterEqualStrategyNumber:
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.low_elem = datumCopy(skey->sk_argument,
+ attr->attbyval, attr->attlen);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ case BTGreaterStrategyNumber:
+
+ /*
+ * detect if scan key argument will be > high_value once
+ * incremented
+ */
+ cmpresult = _bt_compare_array_skey(orderprocp,
+ skey->sk_argument, false,
+ array->sksup.high_elem, false,
+ arraysk);
+ if (cmpresult >= 0)
+ {
+ /* incrementing would make qual unsatisfiable, so don't try */
+ *qual_ok = false;
+ return;
+ }
+
+ /* incremented scan key value becomes skip array's new low_value */
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.low_elem = _bt_apply_increment(rel, skey, array);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ default:
+ elog(ERROR, "unrecognized StrategyNumber: %d",
+ (int) skey->sk_strategy);
+ break;
+ }
+
+ /*
+ * Is the qual contradictory, or is it merely "redundant" with consed-up
+ * skip array?
+ */
+ cmpresult = _bt_compare_array_skey(orderproc, /* don't use orderprocp */
+ array->sksup.low_elem, false,
+ array->sksup.high_elem, false,
+ arraysk);
+ *qual_ok = (cmpresult <= 0);
}
/*
@@ -1130,7 +1709,8 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
static inline int32
_bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur)
+ Datum arrdatum, bool arrnull,
+ ScanKey cur)
{
int32 result = 0;
@@ -1138,14 +1718,14 @@ _bt_compare_array_skey(FmgrInfo *orderproc,
if (tupnull) /* NULL tupdatum */
{
- if (cur->sk_flags & SK_ISNULL)
+ if (arrnull)
result = 0; /* NULL "=" NULL */
else if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+ else if (arrnull) /* NOT_NULL tupdatum, NULL arrdatum */
{
if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -1211,6 +1791,8 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
Datum arrdatum;
Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(!(cur->sk_flags & SK_BT_SKIP));
+ Assert(!(cur->sk_flags & SK_ISNULL)); /* plain arrays can't do this */
Assert(cur->sk_strategy == BTEqualStrategyNumber);
if (cur_elem_trig)
@@ -1246,7 +1828,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[low_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result <= 0)
{
@@ -1274,7 +1856,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[high_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result >= 0)
{
@@ -1301,7 +1883,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
arrdatum = array->elem_values[mid_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result == 0)
{
@@ -1326,13 +1908,123 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
*/
if (low_elem != mid_elem)
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- array->elem_values[low_elem], cur);
+ array->elem_values[low_elem], false,
+ cur);
*set_elem_result = result;
return low_elem;
}
+/*
+ * _bt_binsrch_skiparray_skey() -- "Binary search" within a skip array
+ *
+ * Skip scan arrays procedurally generate their elements on-demand. They
+ * largely function in the same way as standard arrays. They can be rolled
+ * over by standard arrays (standard array can also roll over skip arrays).
+ *
+ * This routine doesn't return an index into the array, because the array
+ * doesn't actually have any elements (it has low_value and high_value, which
+ * indicate the range of values that the array can generate). Note that this
+ * may include a NULL value/an IS NULL qual (unlike with true arrays).
+ *
+ * Sets *set_elem_result just like _bt_binsrch_array_skey would with a true
+ * array. The value 0 indicates that tupdatum/tupnull is within the range of
+ * the skip array. Other values indicate what _bt_compare_array_skey returned
+ * for the best available match to tupdatum/tupnull (in practice this means
+ * either the lowest item or the highest item in the range of the array).
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this skip
+ * array's scan key. We can apply this information to find the next matching
+ * array element in the current scan direction using fewer comparisons.
+ */
+static void
+_bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result)
+{
+ Datum arrdatum;
+ bool arrnull;
+
+ Assert(!ScanDirectionIsNoMovement(dir));
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_flags & SK_BT_REQFWD);
+ Assert(array->num_elems == -1);
+
+ /*
+ * Compare tupdatum against "first array element" in the current scan
+ * direction first (and allow NULL to be treated as a possible element).
+ *
+ * Optimization: don't have to bother with this when passed a skip array
+ * that is known to have triggered array advancement.
+ */
+ if (!cur_elem_trig)
+ {
+ if (ScanDirectionIsForward(dir))
+ {
+ arrdatum = array->sksup.low_elem;
+ arrnull = array->null_elem && (cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+ else
+ {
+ arrdatum = array->sksup.high_elem;
+ arrnull = array->null_elem && !(cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+
+ *set_elem_result = _bt_compare_array_skey(orderproc,
+ tupdatum, tupnull,
+ arrdatum, arrnull,
+ cur);
+
+ /*
+ * Optimization: return early when >= lower bound happens to be an
+ * exact match (or when <= upper bound is an exact match during a
+ * backwards scan)
+ */
+ if (*set_elem_result == 0)
+ return;
+
+ /* Is tupdatum "before the start" of our lowest "element"? */
+ if ((ScanDirectionIsForward(dir) && *set_elem_result < 0) ||
+ (ScanDirectionIsBackward(dir) && *set_elem_result > 0))
+ return;
+ }
+
+ /*
+ * Now compare tupdatum to the "last array element" in the current scan
+ * direction (and allow NULL to be treated as a possible element)
+ */
+ if (ScanDirectionIsForward(dir))
+ {
+ arrdatum = array->sksup.high_elem;
+ arrnull = array->null_elem && !(cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+ else
+ {
+ arrdatum = array->sksup.low_elem;
+ arrnull = array->null_elem && (cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+
+ *set_elem_result = _bt_compare_array_skey(orderproc,
+ tupdatum, tupnull,
+ arrdatum, arrnull,
+ cur);
+
+ /* Is tupdatum "after the end" of our highest "element"? */
+ if ((ScanDirectionIsForward(dir) && *set_elem_result > 0) ||
+ (ScanDirectionIsBackward(dir) && *set_elem_result < 0))
+ return;
+
+ /*
+ * tupdatum must be within the range of the skip array. Have our caller
+ * treat tupdatum as one of the array's elements.
+ */
+ *set_elem_result = 0;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -1342,29 +2034,257 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
void
_bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- int i;
Assert(so->numArrayKeys);
Assert(so->qual_ok);
- for (i = 0; i < so->numArrayKeys; i++)
+ for (int i = 0; i < so->numArrayKeys; i++)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
- Assert(curArrayKey->num_elems > 0);
Assert(skey->sk_flags & SK_SEARCHARRAY);
- if (ScanDirectionIsBackward(dir))
- curArrayKey->cur_elem = curArrayKey->num_elems - 1;
- else
- curArrayKey->cur_elem = 0;
- skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
+ _bt_scankey_set_low_or_high(rel, skey, curArrayKey,
+ ScanDirectionIsForward(dir));
}
so->scanBehind = false;
}
+/*
+ * _bt_scankey_decrement() -- decrement scan key's sk_argument
+ *
+ * Unsets scan key "IS NULL" flags when required, and handles memory
+ * management for pass-by-reference types.
+ */
+static void
+_bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (skey->sk_flags & SK_ISNULL)
+ _bt_scankey_unset_isnull(rel, skey, array);
+ else
+ {
+ Datum dec_sk_argument;
+ Form_pg_attribute attr;
+
+ /* Get a decremented copy of existing sk_argument */
+ dec_sk_argument = _bt_apply_decrement(rel, skey, array);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set decremented copy of original sk_argument in scan key */
+ skey->sk_argument = dec_sk_argument;
+ }
+}
+
+/*
+ * _bt_scankey_increment() -- increment scan key's sk_argument
+ *
+ * Unsets scan key "IS NULL" flags when required, and handles memory
+ * management for pass-by-reference types.
+ */
+static void
+_bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (skey->sk_flags & SK_ISNULL)
+ _bt_scankey_unset_isnull(rel, skey, array);
+ else
+ {
+ Datum inc_sk_argument;
+ Form_pg_attribute attr;
+
+ /* Get an incremented copy of existing sk_argument */
+ inc_sk_argument = _bt_apply_increment(rel, skey, array);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set incremented copy of original sk_argument in scan key */
+ skey->sk_argument = inc_sk_argument;
+ }
+}
+
+/*
+ * _bt_scankey_set_low_or_high() -- Set array scan key to lowest/highest element
+ *
+ * Caller also passes associated scan key, which will have its argument set to
+ * the lowest/highest array value in passing.
+ */
+static void
+_bt_scankey_set_low_or_high(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ bool low_not_high)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (array->num_elems != -1)
+ {
+ /* set low or high element for conventional array */
+ int set_elem = 0;
+
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+
+ if (!low_not_high)
+ set_elem = array->num_elems - 1;
+
+ /*
+ * Just copy over array datum (only skip arrays require freeing and
+ * allocating memory for sk_argument)
+ */
+ array->cur_elem = set_elem;
+ skey->sk_argument = array->elem_values[set_elem];
+
+ return;
+ }
+
+ /* set low or high element for skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ if (array->null_elem &&
+ (low_not_high == ((skey->sk_flags & SK_BT_NULLS_FIRST) != 0)))
+ {
+ /* Set element to NULL (lowest/highest element) */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+ }
+ else
+ {
+ /* Lowest array element isn't NULL */
+ if (low_not_high)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL);
+ }
+}
+
+/*
+ * _bt_scankey_set_element() -- Set skip array scan key's sk_argument
+ *
+ * Sets scan key to "IS NULL" when required, and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull)
+{
+ /* tupdatum within the range of low_value/high_value */
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /*
+ * Treat tupdatum/tupnull as a matching array element.
+ *
+ * We just copy tupdatum into the array's scan key (there is no
+ * conventional array element for us to set, of course).
+ */
+ if (tupnull)
+ {
+ /*
+ * Unlike standard arrays, skip arrays sometimes need to locate NULLs.
+ * We can treat them as just another value from the domain of indexed
+ * values.
+ */
+ Assert(array->null_elem);
+ Assert(!(skey->sk_flags & (SK_SEARCHNULL | SK_ISNULL)));
+
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+ }
+ else
+ {
+ skey->sk_argument = datumCopy(tupdatum,
+ attr->attbyval, attr->attlen);
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL);
+ }
+}
+
+/*
+ * _bt_scankey_unset_isnull() -- increment/decrement scan key from NULL
+ *
+ * Unsets scan key's "IS NULL" marking, and sets the non-NULL value from the
+ * array immediately before (or immediate after) NULL in the key space.
+ */
+static void
+_bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(skey->sk_flags & SK_ISNULL);
+ Assert(array->null_elem);
+
+ /*
+ * sk_argument must be set to whatever non-NULL value comes immediately
+ * before or after NULL
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL);
+ if (skey->sk_flags & SK_BT_NULLS_FIRST)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+}
+
+/*
+ * _bt_scankey_set_isnull() -- increment/decrement scan key to NULL
+ *
+ * Sets scan key to "IS NULL", and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_ISNULL));
+ Assert(array->null_elem);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set sk_argument to NULL */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
/*
* _bt_advance_array_keys_increment() -- Advance to next set of array elements
*
@@ -1380,6 +2300,7 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
static bool
_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
/*
@@ -1391,10 +2312,24 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
+ FmgrInfo *orderproc = &so->orderProcs[curArrayKey->scan_key];
int cur_elem = curArrayKey->cur_elem;
int num_elems = curArrayKey->num_elems;
bool rolled = false;
+ /* Handle incrementing a skip array */
+ if (num_elems == -1)
+ {
+ /* Attempt to incrementally advance this skip scan array */
+ if (_bt_advance_skip_array_key_increment(rel, dir, curArrayKey,
+ skey, orderproc))
+ return true;
+
+ /* Array rolled over. Need to advance next array key, if any. */
+ continue;
+ }
+
+ /* Handle incrementing a true array */
if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
{
cur_elem = 0;
@@ -1411,7 +2346,7 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
if (!rolled)
return true;
- /* Need to advance next array key, if any */
+ /* Array rolled over. Need to advance next array key, if any. */
}
/*
@@ -1429,6 +2364,95 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
return false;
}
+/*
+ * _bt_advance_skip_array_key_increment() -- increment a skip scan array
+ *
+ * Returns true when the skip array was successfully incremented to the next
+ * value in the current scan direction, dir. Otherwise handles roll over by
+ * setting array to its final element for the current scan direction.
+ */
+static bool
+_bt_advance_skip_array_key_increment(Relation rel, ScanDirection dir,
+ BTArrayKeyInfo *array, ScanKey skey,
+ FmgrInfo *orderproc)
+{
+ Datum sk_argument = skey->sk_argument;
+ bool sk_isnull = (skey->sk_flags & SK_ISNULL) != 0;
+ int compare;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(array->num_elems == -1);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* high_elem is final non-NULL element in current scan direction */
+ compare = _bt_compare_array_skey(orderproc,
+ array->sksup.high_elem, false,
+ sk_argument, sk_isnull,
+ skey);
+ if (compare > 0)
+ {
+ /* Increment non-NULL element to next non-NULL element */
+ _bt_scankey_increment(rel, skey, array);
+
+ return true;
+ }
+ else if (compare == 0 && array->null_elem &&
+ !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to high_elem. Increment
+ * from high_elem to final NULL element (without calling opclass
+ * support function, which doesn't know how to handle NULLs).
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+
+ return true;
+ }
+
+ /* Exhausted all array elements in current scan direction */
+ }
+ else
+ {
+ /* low_elem is final non-NULL element in current scan direction */
+ compare = _bt_compare_array_skey(orderproc,
+ array->sksup.low_elem, false,
+ sk_argument, sk_isnull,
+ skey);
+ if (compare < 0)
+ {
+ /* Decrement non-NULL element to previous non-NULL element */
+ _bt_scankey_decrement(rel, skey, array);
+
+ return true;
+ }
+ else if (compare == 0 && array->null_elem &&
+ (skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to low_elem. Decrement
+ * from low_elem to final NULL element (without calling opclass
+ * support function, which doesn't know how to handle NULLs).
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+
+ return true;
+ }
+
+ /* Exhausted all array elements in current scan direction */
+ }
+
+ /*
+ * Skip array rolls over. Start over at the array's lowest sorting value
+ * (or its highest value, for backward scans).
+ */
+ _bt_scankey_set_low_or_high(rel, skey, array, ScanDirectionIsForward(dir));
+
+ /* Caller must consider earlier/more significant arrays in turn */
+ return false;
+}
+
/*
* _bt_rewind_nonrequired_arrays() -- Rewind non-required arrays
*
@@ -1466,6 +2490,7 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
static void
_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int arrayidx = 0;
@@ -1473,7 +2498,6 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
ScanKey cur = so->keyData + ikey;
BTArrayKeyInfo *array = NULL;
- int first_elem_dir;
if (!(cur->sk_flags & SK_SEARCHARRAY) ||
cur->sk_strategy != BTEqualStrategyNumber)
@@ -1485,16 +2509,10 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
continue;
- if (ScanDirectionIsForward(dir))
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
+ Assert(array->num_elems != -1); /* No skipping of non-required arrays */
- if (array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -1558,6 +2576,8 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
{
ScanKey cur = so->keyData + ikey;
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
Datum tupdatum;
bool tupnull;
int32 result;
@@ -1621,7 +2641,8 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull,
+ cur);
/*
* Does this comparison indicate that caller must _not_ advance the
@@ -1954,18 +2975,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (beyond_end_advance)
{
- int final_elem_dir;
-
- if (ScanDirectionIsBackward(dir) || !array)
- final_elem_dir = 0;
- else
- final_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != final_elem_dir)
- {
- array->cur_elem = final_elem_dir;
- cur->sk_argument = array->elem_values[final_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
continue;
}
@@ -1990,18 +3002,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (!all_required_satisfied || cur->sk_attno > tupnatts)
{
- int first_elem_dir;
-
- if (ScanDirectionIsForward(dir) || !array)
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
continue;
}
@@ -2019,15 +3022,27 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
/*
* Binary search for closest match that's available from the array
*/
- set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
- cur_elem_trig, dir,
- tupdatum, tupnull, array, cur,
- &result);
+ if (array->num_elems != -1)
+ set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
- Assert(set_elem >= 0 && set_elem < array->num_elems);
+ /*
+ * Skip array. "Binary search" by checking if tupdatum/tupnull
+ * are within the low_value/high_value range of the skip array.
+ */
+ else
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
}
else
{
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
+
Assert(sktrig_required && required);
/*
@@ -2041,7 +3056,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull, cur);
}
/*
@@ -2061,6 +3076,10 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
* its final element is. Once outside the loop we'll then "increment
* this array's set_elem" by calling _bt_advance_array_keys_increment.
* That way the process rolls over to higher order arrays as needed.
+ * The skip array case will set the array's scan key to the final
+ * valid element for the current scan direction, which is equivalent
+ * (when we have a real set_elem "match" it's just the final element
+ * in the current scan direction).
*
* Under this scheme any required arrays only ever ratchet forwards
* (or backwards), and always do so to the maximum possible extent
@@ -2100,11 +3119,62 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
}
}
- /* Advance array keys, even when set_elem isn't an exact match */
- if (array && array->cur_elem != set_elem)
+ /* Advance array keys, even when we don't have an exact match */
+
+ if (!array)
+ continue; /* no element to set in non-array */
+
+ /* Conventional arrays have a valid set_elem for us to advance to */
+ if (array->num_elems != -1)
{
- array->cur_elem = set_elem;
- cur->sk_argument = array->elem_values[set_elem];
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ cur->sk_argument = array->elem_values[set_elem];
+ }
+
+ continue;
+ }
+
+ /*
+ * Conceptually, skip arrays also have array elements. The actual
+ * elements/values are generated procedurally and on demand.
+ */
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+ Assert(required);
+
+ if (result == 0)
+ {
+ /*
+ * Anything within the range of possible element values is treated
+ * as "a match for one of the array's elements". Store the next
+ * scan key argument value by taking a copy of the tupdatum value
+ * from caller's tuple (or set scan key IS NULL when tupnull, iff
+ * the array's range of possible elements covers NULL).
+ */
+ _bt_scankey_set_element(rel, cur, array, tupdatum, tupnull);
+ }
+ else if (beyond_end_advance)
+ {
+ /*
+ * We need to set the array element to the final "element" in the
+ * current scan direction for "beyond end of array element" array
+ * advancement. See above for an explanation.
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
+ }
+ else
+ {
+ /*
+ * The closest matching element is the lowest element; even that
+ * still puts us ahead of caller's tuple in the key space. This
+ * process has to carry to any lower-order arrays. See above for
+ * an explanation.
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -2460,10 +3530,12 @@ end_toplevel_scan:
/*
* _bt_preprocess_keys() -- Preprocess scan keys
*
+ * The first call here (per btrescan) allocates so->keyData[].
* The given search-type keys (taken from scan->keyData[])
* are copied to so->keyData[] with possible transformation.
* scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
- * the number of output keys (possibly less, never greater).
+ * the number of output keys. Calling here a second time (during the same
+ * btrescan) is a no-op.
*
* The output keys are marked with additional sk_flags bits beyond the
* system-standard bits supplied by the caller. The DESC and NULLS_FIRST
@@ -2483,6 +3555,8 @@ end_toplevel_scan:
* within each attribute may be done as a byproduct of the processing here.
* That process must leave array scan keys (within an attribute) in the same
* order as corresponding entries from the scan's BTArrayKeyInfo array info.
+ * We might also cons up skip array scan keys that weren't present in the
+ * original input keys; these are also output in standard attribute order.
*
* The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
* if they must be satisfied in order to continue the scan forward or backward
@@ -2550,9 +3624,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
int16 *indoption = scan->indexRelation->rd_indoption;
int new_numberOfKeys;
int numberOfEqualCols;
- ScanKey inkeys;
- ScanKey outkeys;
- ScanKey cur;
+ ScanKey inputsk;
BTScanKeyPreproc xform[BTMaxStrategyNumber];
bool test_result;
int i,
@@ -2584,7 +3656,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
return; /* done if qual-less scan */
/* If any keys are SK_SEARCHARRAY type, set up array-key info */
- arrayKeyData = _bt_preprocess_array_keys(scan);
+ arrayKeyData = _bt_preprocess_array_keys(scan, &numberOfKeys);
if (!so->qual_ok)
{
/* unmatchable array, so give up */
@@ -2598,32 +3670,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
if (arrayKeyData)
{
- inkeys = arrayKeyData;
+ inputsk = arrayKeyData;
/* Also maintain keyDataMap for remapping so->orderProc[] later */
keyDataMap = MemoryContextAlloc(so->arrayContext,
numberOfKeys * sizeof(int));
}
else
- inkeys = scan->keyData;
+ inputsk = scan->keyData;
+
+ /*
+ * Now that we have an estimate of the number of output scan keys
+ * (including any skip array scan keys), allocate space for them
+ */
+ so->keyData = palloc(sizeof(ScanKeyData) * numberOfKeys);
- outkeys = so->keyData;
- cur = &inkeys[0];
/* we check that input keys are correctly ordered */
- if (cur->sk_attno < 1)
+ if (inputsk->sk_attno < 1)
elog(ERROR, "btree index keys must be ordered by attribute");
/* We can short-circuit most of the work if there's just one key */
if (numberOfKeys == 1)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
so->qual_ok = false;
- memcpy(outkeys, cur, sizeof(ScanKeyData));
+ memcpy(so->keyData, inputsk, sizeof(ScanKeyData));
so->numberOfKeys = 1;
/* We can mark the qual as required if it's for first index col */
- if (cur->sk_attno == 1)
- _bt_mark_scankey_required(outkeys);
+ if (inputsk->sk_attno == 1)
+ _bt_mark_scankey_required(so->keyData);
if (arrayKeyData)
{
/*
@@ -2631,8 +3707,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
* (we'll miss out on the single value array transformation, but
* that's not nearly as important when there's only one scan key)
*/
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+ Assert(so->keyData[0].sk_flags & SK_SEARCHARRAY);
+ Assert(so->keyData[0].sk_strategy != BTEqualStrategyNumber ||
(so->arrayKeys[0].scan_key == 0 &&
OidIsValid(so->orderProcs[0].fn_oid)));
}
@@ -2660,12 +3736,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* handle after-last-key processing. Actual exit from the loop is at the
* "break" statement below.
*/
- for (i = 0;; cur++, i++)
+ for (i = 0;; inputsk++, i++)
{
if (i < numberOfKeys)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
{
/* NULL can't be matched, so give up */
so->qual_ok = false;
@@ -2677,12 +3753,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* If we are at the end of the keys for a particular attr, finish up
* processing and emit the cleaned-up keys.
*/
- if (i == numberOfKeys || cur->sk_attno != attno)
+ if (i == numberOfKeys || inputsk->sk_attno != attno)
{
int priorNumberOfEqualCols = numberOfEqualCols;
/* check input keys are correctly ordered */
- if (i < numberOfKeys && cur->sk_attno < attno)
+ if (i < numberOfKeys && inputsk->sk_attno < attno)
elog(ERROR, "btree index keys must be ordered by attribute");
/*
@@ -2741,7 +3817,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
return;
}
/* else discard the redundant non-equality key */
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
xform[j].skey = NULL;
xform[j].ikey = -1;
}
@@ -2786,7 +3863,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
}
/*
- * Emit the cleaned-up keys into the outkeys[] array, and then
+ * Emit the cleaned-up keys into the so->keyData[] array, and then
* mark them if they are required. They are required (possibly
* only in one direction) if all attrs before this one had "=".
*/
@@ -2794,7 +3871,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
{
if (xform[j].skey)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
@@ -2811,19 +3888,19 @@ _bt_preprocess_keys(IndexScanDesc scan)
break;
/* Re-initialize for new attno */
- attno = cur->sk_attno;
+ attno = inputsk->sk_attno;
memset(xform, 0, sizeof(xform));
}
/* check strategy this key's operator corresponds to */
- j = cur->sk_strategy - 1;
+ j = inputsk->sk_strategy - 1;
/* if row comparison, push it directly to the output array */
- if (cur->sk_flags & SK_ROW_HEADER)
+ if (inputsk->sk_flags & SK_ROW_HEADER)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
- memcpy(outkey, cur, sizeof(ScanKeyData));
+ memcpy(outkey, inputsk, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = i;
if (numberOfEqualCols == attno - 1)
@@ -2837,19 +3914,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
continue;
}
- /*
- * Does this input scan key require further processing as an array?
- */
- if (cur->sk_strategy == InvalidStrategy)
- {
- /* _bt_preprocess_array_keys marked this array key redundant */
- Assert(arrayKeyData);
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- continue;
- }
-
- if (cur->sk_strategy == BTEqualStrategyNumber &&
- (cur->sk_flags & SK_SEARCHARRAY))
+ if (inputsk->sk_strategy == BTEqualStrategyNumber &&
+ (inputsk->sk_flags & SK_SEARCHARRAY))
{
/* _bt_preprocess_array_keys kept this array key */
Assert(arrayKeyData);
@@ -2863,7 +3929,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (xform[j].skey == NULL)
{
/* nope, so this scan key wins by default (at least for now) */
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2881,7 +3947,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
/*
* Have to set up array keys
*/
- if ((cur->sk_flags & SK_SEARCHARRAY))
+ if ((inputsk->sk_flags & SK_SEARCHARRAY))
{
array = &so->arrayKeys[arrayidx - 1];
orderproc = so->orderProcs + i;
@@ -2909,13 +3975,15 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
}
- if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
- array, orderproc, &test_result))
+ if (_bt_compare_scankey_args(scan, inputsk, inputsk,
+ xform[j].skey, array, orderproc,
+ &test_result))
{
/* Have all we need to determine redundancy */
if (test_result)
{
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
/*
* New key is more restrictive, and so replaces old key...
@@ -2923,7 +3991,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2936,7 +4004,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
* scan key. _bt_compare_scankey_args expects us to
* always keep arrays (and discard non-arrays).
*/
- Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+ Assert(!(inputsk->sk_flags & SK_SEARCHARRAY));
}
}
else if (j == (BTEqualStrategyNumber - 1))
@@ -2959,14 +4027,14 @@ _bt_preprocess_keys(IndexScanDesc scan)
* even with incomplete opfamilies. _bt_advance_array_keys
* depends on this.
*/
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
if (numberOfEqualCols == attno - 1)
_bt_mark_scankey_required(outkey);
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -3057,10 +4125,11 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
if (array->scan_key != ikey)
return false;
- if (array->num_elems <= 0)
+ if (array->num_elems == 0 || array->num_elems < -1)
return false;
- if (cur->sk_argument != array->elem_values[array->cur_elem])
+ if (array->num_elems != -1 &&
+ cur->sk_argument != array->elem_values[array->cur_elem])
return false;
if (last_sk_attno > cur->sk_attno)
return false;
@@ -3135,6 +4204,22 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool leftnull,
rightnull;
+ /* Handle skip array comparison with IS NOT NULL scan key */
+ if ((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP)
+ {
+ /* Shouldn't generate skip array in presence of IS NULL key */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNULL));
+ Assert((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNOTNULL);
+
+ /* Don't allow skip array to generate IS NULL scan key/element */
+ Assert(array->num_elems == -1);
+ array->null_elem = false;
+
+ /* IS NOT NULL key (could be leftarg or rightarg) now redundant */
+ *result = true;
+ return true;
+ }
+
if (leftarg->sk_flags & SK_ISNULL)
{
Assert(leftarg->sk_flags & (SK_SEARCHNULL | SK_SEARCHNOTNULL));
@@ -3208,6 +4293,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
{
/* Can't make the comparison */
*result = false; /* suppress compiler warnings */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
return false;
}
@@ -3380,13 +4466,6 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
return true;
}
- if (skey->sk_strategy == InvalidStrategy)
- {
- /* Already-eliminated array scan key; don't need to fix anything */
- Assert(skey->sk_flags & SK_SEARCHARRAY);
- return true;
- }
-
/* Adjust strategy for DESC, if we didn't already */
if ((addflags & SK_BT_DESC) && !(skey->sk_flags & SK_BT_DESC))
skey->sk_strategy = BTCommuteStrategyNumber(skey->sk_strategy);
diff --git a/src/backend/access/nbtree/nbtvalidate.c b/src/backend/access/nbtree/nbtvalidate.c
index e9d4cd60d..96d0d9185 100644
--- a/src/backend/access/nbtree/nbtvalidate.c
+++ b/src/backend/access/nbtree/nbtvalidate.c
@@ -114,6 +114,10 @@ btvalidate(Oid opclassoid)
case BTOPTIONS_PROC:
ok = check_amoptsproc_signature(procform->amproc);
break;
+ case BTSKIPSUPPORT_PROC:
+ ok = check_amproc_signature(procform->amproc, VOIDOID, true,
+ 1, 1, INTERNALOID);
+ break;
default:
ereport(INFO,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index b8b5c147c..a86dbf71b 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -1330,6 +1330,31 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("btree equal image functions must not be cross-type")));
}
+ else if (member->number == BTSKIPSUPPORT_PROC)
+ {
+ if (procform->pronargs != 1 ||
+ procform->proargtypes.values[0] != INTERNALOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must accept type \"internal\"")));
+ if (procform->prorettype != VOIDOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must return void")));
+
+ /*
+ * pg_amproc functions are indexed by (lefttype, righttype), but a
+ * skip support function doesn't make sense in cross-type
+ * scenarios. The same opclass opcintype OID is always used for
+ * lefttype and righttype. Providing a cross-type routine isn't
+ * sensible. Reject cross-type ALTER OPERATOR FAMILY ... ADD
+ * FUNCTION 6 statements here.
+ */
+ if (member->lefttype != member->righttype)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must not be cross-type")));
+ }
}
else if (amoid == HASH_AM_OID)
{
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index edb09d4e3..e945686c8 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -96,6 +96,7 @@ OBJS = \
rowtypes.o \
ruleutils.o \
selfuncs.o \
+ skipsupport.o \
tid.o \
timestamp.o \
trigfuncs.o \
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 9c854e0e5..ea3d0f4b5 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -34,6 +34,7 @@
#include "utils/date.h"
#include "utils/datetime.h"
#include "utils/numeric.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
/*
@@ -455,6 +456,39 @@ date_sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+date_decrement(Relation rel, Datum existing)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ Assert(dexisting > DATEVAL_NOBEGIN);
+
+ return DateADTGetDatum(dexisting - 1);
+}
+
+static Datum
+date_increment(Relation rel, Datum existing)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ Assert(dexisting < DATEVAL_NOEND);
+
+ return DateADTGetDatum(dexisting + 1);
+}
+
+Datum
+date_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = date_decrement;
+ sksup->increment = date_increment;
+ sksup->low_elem = DateADTGetDatum(DATEVAL_NOBEGIN);
+ sksup->high_elem = DateADTGetDatum(DATEVAL_NOEND);
+
+ PG_RETURN_VOID();
+}
+
Datum
date_finite(PG_FUNCTION_ARGS)
{
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index 8c6fc80c3..91682edd5 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -83,6 +83,7 @@ backend_sources += files(
'rowtypes.c',
'ruleutils.c',
'selfuncs.c',
+ 'skipsupport.c',
'tid.c',
'timestamp.c',
'trigfuncs.c',
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d..c1df7be9f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6800,6 +6800,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
List *indexBoundQuals;
int indexcol;
bool eqQualHere;
+ bool found_skip;
bool found_saop;
bool found_is_null_op;
double num_sa_scans;
@@ -6825,6 +6826,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
+ found_skip = false;
found_saop = false;
found_is_null_op = false;
num_sa_scans = 1;
@@ -6833,15 +6835,38 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
IndexClause *iclause = lfirst_node(IndexClause, lc);
ListCell *lc2;
+ /*
+ * XXX For now we just cost skip scans via generic rules: make a
+ * uniform assumption that there will be 10 primitive index scans per
+ * skipped attribute, relying on the "1/3 of all index pages" cap that
+ * this costing has used since Postgres 17. Also assume that skipping
+ * won't take place for an index that has fewer than 100 pages.
+ *
+ * The current approach to costing leaves much to be desired, but is
+ * at least better than nothing at all (keeping the code as it is on
+ * HEAD just makes testing and review inconvenient).
+ */
if (indexcol != iclause->indexcol)
{
/* Beginning of a new column's quals */
if (!eqQualHere)
- break; /* done if no '=' qual for indexcol */
+ {
+ found_skip = true; /* skip when no '=' qual for indexcol */
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10;
+ }
eqQualHere = false;
indexcol++;
if (indexcol != iclause->indexcol)
- break; /* no quals at all for indexcol */
+ {
+ /* no quals at all for indexcol */
+ found_skip = true;
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10 * (indexcol - iclause->indexcol);
+ continue;
+ }
}
/* Examine each indexqual associated with this index clause */
@@ -6914,6 +6939,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (index->unique &&
indexcol == index->nkeycolumns - 1 &&
eqQualHere &&
+ !found_skip &&
!found_saop &&
!found_is_null_op)
numIndexTuples = 1.0;
diff --git a/src/backend/utils/adt/skipsupport.c b/src/backend/utils/adt/skipsupport.c
new file mode 100644
index 000000000..9665e4985
--- /dev/null
+++ b/src/backend/utils/adt/skipsupport.c
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.c
+ * Support routines for B-Tree skip scans.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/skipsupport.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "access/nbtree.h"
+#include "utils/lsyscache.h"
+#include "utils/skipsupport.h"
+
+/*
+ * Fill in SkipSupport given an operator class (opfamily + opcintype).
+ *
+ * On success, returns true, and initializes all SkipSupport fields for
+ * caller. Otherwise returns false, indicating that operator class has no
+ * skip support function.
+ */
+bool
+PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype, bool reverse,
+ SkipSupport sksup)
+{
+ Oid skipSupportFunction;
+
+ /* Look for a skip support function */
+ skipSupportFunction = get_opfamily_proc(opfamily, opcintype, opcintype,
+ BTSKIPSUPPORT_PROC);
+ if (!OidIsValid(skipSupportFunction))
+ return false;
+
+ OidFunctionCall1(skipSupportFunction, PointerGetDatum(sksup));
+
+ if (reverse)
+ {
+ Datum low_elem = sksup->low_elem;
+
+ sksup->low_elem = sksup->high_elem;
+ sksup->high_elem = low_elem;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c
index 45eb1b2fe..a9222f896 100644
--- a/src/backend/utils/adt/uuid.c
+++ b/src/backend/utils/adt/uuid.c
@@ -13,12 +13,15 @@
#include "postgres.h"
+#include <limits.h>
+
#include "common/hashfn.h"
#include "lib/hyperloglog.h"
#include "libpq/pqformat.h"
#include "port/pg_bswap.h"
#include "utils/fmgrprotos.h"
#include "utils/guc.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#include "utils/timestamp.h"
#include "utils/uuid.h"
@@ -390,6 +393,68 @@ uuid_abbrev_convert(Datum original, SortSupport ssup)
return res;
}
+static Datum
+uuid_decrement(Relation rel, Datum existing)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] > 0)
+ {
+ uuid->data[i]--;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = UCHAR_MAX;
+ }
+
+ Assert(false);
+
+ return UUIDPGetDatum(uuid);
+}
+
+static Datum
+uuid_increment(Relation rel, Datum existing)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] < UCHAR_MAX)
+ {
+ uuid->data[i]++;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = 0;
+ }
+
+ Assert(false);
+
+ return UUIDPGetDatum(uuid);
+}
+
+Datum
+uuid_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+ pg_uuid_t *uuid_min = palloc(UUID_LEN);
+ pg_uuid_t *uuid_max = palloc(UUID_LEN);
+
+ memset(uuid_min->data, 0x00, UUID_LEN);
+ memset(uuid_max->data, 0xFF, UUID_LEN);
+
+ sksup->decrement = uuid_decrement;
+ sksup->increment = uuid_increment;
+ sksup->low_elem = UUIDPGetDatum(uuid_min);
+ sksup->high_elem = UUIDPGetDatum(uuid_max);
+
+ PG_RETURN_VOID();
+}
+
/* hash index support */
Datum
uuid_hash(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 6f4188599..8ec3f150a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/nbtree.h"
#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
@@ -3523,6 +3524,17 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ /* XXX Remove before commit */
+ {
+ {"skipscan_prefix_cols", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_prefix_cols,
+ INDEX_MAX_KEYS, 0, INDEX_MAX_KEYS,
+ NULL, NULL, NULL
+ },
+
{
/* Can't be set in postgresql.conf */
{"server_version_num", PGC_INTERNAL, PRESET_OPTIONS,
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 2b3997988..9662fb2ba 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -583,6 +583,19 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</para>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><function>skipsupport</function></term>
+ <listitem>
+ <para>
+ Optionally, a btree operator family may provide a <firstterm>skip
+ support</firstterm> function, registered under support function
+ number 6. These functions allow the B-tree code to more efficiently
+ navigate the index structure via an index <quote>skip scan</quote>. The
+ APIs involved in this are defined in
+ <filename>src/include/utils/skipsupport.h</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</sect2>
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 22d8ad1aa..f17dd3456 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -461,6 +461,13 @@
</entry>
<entry>5</entry>
</row>
+ <row>
+ <entry>
+ Return the addresses of C-callable skip support function(s)
+ (optional)
+ </entry>
+ <entry>6</entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -1056,7 +1063,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
FUNCTION 1 btint8cmp(int8, int8) ,
FUNCTION 2 btint8sortsupport(internal) ,
FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint8skipsupport(internal);
CREATE OPERATOR CLASS int4_ops
DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -1069,7 +1077,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
FUNCTION 1 btint4cmp(int4, int4) ,
FUNCTION 2 btint4sortsupport(internal) ,
FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint4skipsupport(internal);
CREATE OPERATOR CLASS int2_ops
DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1082,7 +1091,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
FUNCTION 1 btint2cmp(int2, int2) ,
FUNCTION 2 btint2sortsupport(internal) ,
FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint2skipsupport(internal);
ALTER OPERATOR FAMILY integer_ops USING btree ADD
-- cross-type comparisons int8 vs int2
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index ae54cb254..8b6b775c1 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -362,9 +362,9 @@ ERROR: invalid operator number 0, must be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ERROR: operator argument types must be specified in ALTER OPERATOR FAMILY
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ERROR: invalid function number 0, must be between 1 and 5
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR: invalid function number 6, must be between 1 and 5
+ERROR: invalid function number 0, must be between 1 and 6
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
+ERROR: invalid function number 7, must be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
ERROR: STORAGE cannot be specified in ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 3bbe4c5f9..a8d5be6c1 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5138,9 +5138,10 @@ List of access methods
btree | uuid_ops | uuid | uuid | 1 | uuid_cmp
btree | uuid_ops | uuid | uuid | 2 | uuid_sortsupport
btree | uuid_ops | uuid | uuid | 4 | btequalimage
+ btree | uuid_ops | uuid | uuid | 6 | uuid_skipsupport
hash | uuid_ops | uuid | uuid | 1 | uuid_hash
hash | uuid_ops | uuid | uuid | 2 | uuid_hash_extended
-(5 rows)
+(6 rows)
-- check \dconfig
set work_mem = 10240;
diff --git a/src/test/regress/sql/alter_generic.sql b/src/test/regress/sql/alter_generic.sql
index de58d268d..4246afefd 100644
--- a/src/test/regress/sql/alter_generic.sql
+++ b/src/test/regress/sql/alter_generic.sql
@@ -310,7 +310,7 @@ ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 6 < (int4, int2); -- ope
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 0 < (int4, int2); -- operator number should be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6c1caf64..369eca3a4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -218,6 +218,7 @@ BTScanPos
BTScanPosData
BTScanPosItem
BTShared
+BTSkipPreproc
BTSortArrayContext
BTSpool
BTStack
@@ -2650,6 +2651,8 @@ SingleBoundSortItem
SinglePartitionSpec
Size
SkipPages
+SkipSupport
+SkipSupportData
SlabBlock
SlabContext
SlabSlot
--
2.45.2
On Tue, Jul 2, 2024 at 12:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached v2 fixes this bug. The problem was that the skip support
function used by the "char" opclass assumed signed char comparisons,
even though the authoritative B-Tree comparator (support function 1)
uses signed comparisons (via uint8 casting). A simple oversight.
Although v2 gives correct answers to the queries, the scan itself
performs an excessive amount of leaf page accesses. In short, it
behaves just like a full index scan would, even though we should
expect it to skip over significant runs of the index. So that's
another bug.
It looks like the queries you posted have a kind of adversarial
quality to them, as if they were designed to confuse the
implementation. Was it intentional? Did you take them from an existing
test suite somewhere?
The custom instrumentation I use to debug these issues shows:
_bt_readpage: 🍀 1981 with 175 offsets/tuples (leftsib 4032, rightsib 3991) ➡️
_bt_readpage first: (c, n)=(b, 998982285), TID='(1236,173)',
0x7f1464fe9fc0, from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: 173,
nmatching: 174 ✅
_bt_readpage: 🍀 3991 with 175 offsets/tuples (leftsib 1981, rightsib 9) ➡️
_bt_readpage first: (c, n)=(b, 999474517), TID='(4210,9)',
0x7f1464febfc8, from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: 173,
nmatching: 174 ✅
_bt_readpage: 🍀 9 with 229 offsets/tuples (leftsib 3991, rightsib 3104) ➡️
_bt_readpage first: (c, n)=(c, 1606), TID='(882,68)', 0x7f1464fedfc0,
from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: -1, nmatching: 0 ❌
_bt_readpage: 🍀 3104 with 258 offsets/tuples (leftsib 9, rightsib 1685) ➡️
_bt_readpage first: (c, n)=(c, 706836), TID='(3213,4)',
0x7f1464feffc0, from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: -1, nmatching: 0 ❌
*** SNIP, many more "nmatching: 0" pages appear after these two ***
The final _bt_advance_array_keys call for leaf page 3991 should be
scheduling a new primitive index scan (i.e. skipping), but that never
happens. Not entirely sure why that is, but it probably has something
to do with _bt_advance_array_keys failing to hit the
"has_required_opposite_direction_only" path for determining if another
primitive scan is required. You're using an inequality required in the
opposite-to-scan-direction here, so that path is likely to be
relevant.
--
Peter Geoghegan
On Tue, Jul 2, 2024 at 12:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
Although v2 gives correct answers to the queries, the scan itself
performs an excessive amount of leaf page accesses. In short, it
behaves just like a full index scan would, even though we should
expect it to skip over significant runs of the index. So that's
another bug.
Hit "send" too soon. I simply forgot to run "alter table test1 alter
column c type "char";" before running the query. So, I was mistaken
about there still being a bug in v2. The issue here is that we don't
have support for the underlying type, char(1) -- nothing more.
v2 of the patch with your query 1 (when changed to use the "char"
type/opclass instead of the currently unsupported char(1)
type/opclass) performs 395 index related buffer hits, and 5406 heap
block accesses. Whereas it's 3833 index buffer hits with master
(naturally, the same 5406 heap accesses are required with master). In
short, this query isn't particularly sympathetic to the patch. Nor is
it unsympathetic.
--
Peter Geoghegan
Hi Peter,
It looks like the queries you posted have a kind of adversarial
quality to them, as if they were designed to confuse the
implementation. Was it intentional?
To some extent. I merely wrote several queries that I would expect
should benefit from skip scans. Since I didn't look at the queries you
used there was a chance that I will hit something interesting.
Attached v2 fixes this bug. The problem was that the skip support
function used by the "char" opclass assumed signed char comparisons,
even though the authoritative B-Tree comparator (support function 1)
uses signed comparisons (via uint8 casting). A simple oversight. Your
test cases will work with this v2, provided you use "char" (instead of
unadorned char) in the create table statements.
Thanks for v2.
If you change your table definition to CREATE TABLE test1(c "char", n
bigint), then your example queries can use the optimization. This
makes a huge difference.
You are right, it does.
Test1 takes 33.7 ms now (53 ms before the path, x1.57)
Test3 I showed before contained an error in the table definition
(Postgres can't do `n bigint, s text DEFAULT 'text_value' || n`). Here
is the corrected test:
```
CREATE TABLE test3(c "char", n bigint, s text);
CREATE INDEX test3_idx ON test3 USING btree(c,n) INCLUDE(s);
INSERT INTO test3
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n,
'text_value_' || random(0, 1_000_000_000) AS s
FROM generate_series(0, 1_000_000);
EXPLAIN ANALYZE SELECT s FROM test3 WHERE n < 10_000;
```
It runs fast (< 1 ms) and uses the index, as expected.
Test2 with "char" doesn't seem to benefit from the patch anymore
(pretty sure it did in v1). It always chooses Parallel Seq Scans even
if I change the condition to `WHERE n > 999_995_000` or `WHERE n =
999_997_362`. Is it an expected behavior?
I also tried Test4 and Test5.
In Test4 I was curious if scip scans work properly with functional indexes:
```
CREATE TABLE test4(d date, n bigint);
CREATE INDEX test4_idx ON test4 USING btree(extract(year from d),n);
INSERT INTO test4
SELECT ('2024-' || random(1,12) || '-' || random(1,28)) :: date AS d,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);
EXPLAIN ANALYZE SELECT COUNT(*) FROM test4 WHERE n > 900_000_000;
```
The query uses Index Scan, however the performance is worse than with
Seq Scan chosen before the patch. It doesn't matter if I choose '>' or
'=' condition.
Test5 checks how skip scans work with partial indexes:
```
CREATE TABLE test5(c "char", n bigint);
CREATE INDEX test5_idx ON test5 USING btree(c, n) WHERE n > 900_000_000;
INSERT INTO test5
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);
EXPLAIN ANALYZE SELECT COUNT(*) FROM test5 WHERE n > 950_000_000;
```
It runs fast and choses Index Only Scan. But then I discovered that
without the patch Postgres also uses Index Only Scan for this query. I
didn't know it could do this - what is the name of this technique? The
query takes 17.6 ms with the patch, 21 ms without the patch. Not a
huge win but still.
That's all I have for now.
--
Best regards,
Aleksander Alekseev
On Fri, Jul 5, 2024 at 7:04 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
Test2 with "char" doesn't seem to benefit from the patch anymore
(pretty sure it did in v1). It always chooses Parallel Seq Scans even
if I change the condition to `WHERE n > 999_995_000` or `WHERE n =
999_997_362`. Is it an expected behavior?
The "char" opclass's skip support routine was totally broken in v1, so
its performance isn't really relevant. In any case v2 didn't make any
changes to the costing, so I'd expect it to use exactly the same query
plan as v1.
The query uses Index Scan, however the performance is worse than with
Seq Scan chosen before the patch. It doesn't matter if I choose '>' or
'=' condition.
That's because the index has a leading/skipped column of type
"numeric", which isn't a supported type just yet (a supported B-Tree
opclass, actually).
The optimization is effective if you create the expression index with
a cast to integer:
CREATE INDEX test4_idx ON test4 USING btree(((extract(year from d))::int4),n);
This performs much better. Now I see "DEBUG: skipping 1 index
attributes" when I run the query "EXPLAIN (ANALYZE, BUFFERS) SELECT
COUNT(*) FROM test4 WHERE n > 900_000_000", which indicates that the
optimization has in fact been used as expected. There are far fewer
buffers hit with this version of your test4, which also indicates that
the optimization has been effective.
Note that the original numeric expression index test4 showed "DEBUG:
skipping 0 index attributes" when the test query ran, which indicated
that the optimization couldn't be used. I suggest that you look out
for that, by running "set client_min_messages to debug2;" from psql
when testing the patch.
It runs fast and choses Index Only Scan. But then I discovered that
without the patch Postgres also uses Index Only Scan for this query. I
didn't know it could do this - what is the name of this technique?
It is a full index scan. These have been possible for many years now
(possibly well over 20 years).
Arguably, the numeric case that didn't use the optimization (your
test4) should have been costed as a full index scan, but it wasn't --
that's why you didn't get a faster sequential scan, which would have
made a little bit more sense. In general, the costing changes in the
patch are very rough.
That said, this particular problem (the test4 numeric issue) should be
fixed by inventing a way for nbtree to use skip scan with types that
lack skip support. It's not primarily a problem with the costing. At
least not in my mind.
--
Peter Geoghegan
On Fri, Jul 5, 2024 at 8:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
CREATE INDEX test4_idx ON test4 USING btree(((extract(year from d))::int4),n);
This performs much better. Now I see "DEBUG: skipping 1 index
attributes" when I run the query "EXPLAIN (ANALYZE, BUFFERS) SELECT
COUNT(*) FROM test4 WHERE n > 900_000_000", which indicates that the
optimization has in fact been used as expected. There are far fewer
buffers hit with this version of your test4, which also indicates that
the optimization has been effective.
Actually, with an index-only scan it is 281 buffer hits (including
some small number of VM buffer hits) with the patch, versus 2736
buffer hits on master. So a big change to the number of index page
accesses only.
If you use a plain index scan for this, then the cost of random heap
accesses totally dominates, so skip scan cannot possibly give much
benefit. Even a similar bitmap scan requires 4425 distinct heap page accesses,
which is significantly more than the total number of index pages in
the index. 4425 heap pages is almost the entire table; the table
consists of 4480 mainfork blocks.
This is a very nonselective query. It's not at all surprising that
this query (and others like it) hardly benefit at all, except when we
can use an index-only scan (so that the cost of heap accesses doesn't
totally dominate).
--
Peter Geoghegan
Hi,
Since I'd like to understand the skip scan to improve the EXPLAIN output
for multicolumn B-Tree Index[1]Improve EXPLAIN output for multicolumn B-Tree Index /messages/by-id/TYWPR01MB1098260B694D27758FE2BA46FB1C92@TYWPR01MB10982.jpnprd01.prod.outlook.com, I began to try the skip scan with some
queries and look into the source code.
I have some feedback and comments.
(1)
At first, I was surprised to look at your benchmark result because the skip scan
index can improve much performance. I agree that there are many users to be
happy with the feature for especially OLAP use-case. I expected to use v18.
(2)
I found the cost is estimated to much higher if the number of skipped attributes
is more than two. Is it expected behavior?
# Test result. The attached file is the detail of tests.
-- Index Scan
-- The actual time is low since the skip scan works well
-- But the cost is higher than one of seqscan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991 loops=1)
Output: id1, id2, id3, value
Index Cond: (test.id3 = 101)
Buffers: shared hit=4402
Planning:
Buffers: shared hit=7
Planning Time: 0.234 ms
Execution Time: 15.711 ms
(8 rows)
-- Seq Scan
-- actual time is high, but the cost is lower than one of the above Index Scan.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..12676.73 rows=984 width=20) (actual time=0.856..113.861 rows=991 loops=1)
Output: id1, id2, id3, value
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=6370
-> Parallel Seq Scan on public.test (cost=0.00..11578.33 rows=410 width=20) (actual time=0.061..102.016 rows=330 loops=3)
Output: id1, id2, id3, value
Filter: (test.id3 = 101)
Rows Removed by Filter: 333003
Buffers: shared hit=6370
Worker 0: actual time=0.099..98.014 rows=315 loops=1
Buffers: shared hit=2066
Worker 1: actual time=0.054..97.162 rows=299 loops=1
Buffers: shared hit=1858
Planning:
Buffers: shared hit=19
Planning Time: 0.194 ms
Execution Time: 114.129 ms
(18 rows)
I look at btcostestimate() to find the reason and found the bound quals
and cost.num_sa_scans are different from my expectation.
My assumption is
* bound quals is id3=XXX (and id1 and id2 are skipped attributes)
* cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans
per skipped attribute)
But it's wrong. The above index scan result is
* bound quals is NULL
* cost.num_sa_scans = 1
As I know you said the below, but I'd like to know the above is expected or not.
That approach seems far more practicable than preempting the problem
during planning or during nbtree preprocessing. It seems like it'd be
very hard to model the costs statistically. We need revisions to
btcostestimate, of course, but the less we can rely on btcostestimate
the better. As I said, there are no new index paths generated by the
optimizer for any of this.
I couldn't understand why there is the below logic well.
btcostestimate()
(...omit...)
if (indexcol != iclause->indexcol)
{
/* no quals at all for indexcol */
found_skip = true;
if (index->pages < 100)
break;
num_sa_scans += 10 * (indexcol - iclause->indexcol); // why add minus value?
continue; // why skip to add bound quals?
}
(3)
Currently, there is an assumption that "there will be 10 primitive index scans
per skipped attribute". Is any chance to use pg_stats.n_distinct?
[1]: Improve EXPLAIN output for multicolumn B-Tree Index /messages/by-id/TYWPR01MB1098260B694D27758FE2BA46FB1C92@TYWPR01MB10982.jpnprd01.prod.outlook.com
/messages/by-id/TYWPR01MB1098260B694D27758FE2BA46FB1C92@TYWPR01MB10982.jpnprd01.prod.outlook.com
Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION
On Fri, Jul 12, 2024 at 1:19 AM <Masahiro.Ikeda@nttdata.com> wrote:
Since I'd like to understand the skip scan to improve the EXPLAIN output
for multicolumn B-Tree Index[1], I began to try the skip scan with some
queries and look into the source code.
Thanks for the review!
Attached is v3, which generalizes skip scan, allowing it to work with
opclasses/types that lack a skip support routine. In other words, v3
makes skip scan work for all types, including continuous types, where
it's impractical or infeasible to add skip support. So now important
types like text and numeric also get the skip scan optimization (it's
not just discrete types like integer and date, as in previous
versions).
I feel very strongly that everything should be implemented as part of
the new skip array abstraction; the patch should only add the concept
of skip arrays, which should work just like SAOP arrays. We should
avoid introducing any special cases. In short, _bt_advance_array_keys
should work in exactly the same way as it does as of Postgres 17
(except for a few representational differences for skip arrays). This
seems essential because _bt_advance_array_keys inherently need to be
able to trigger moving on to the next skip array value when it reaches
the end of a SAOP array (and vice-versa). And so it just makes sense
to abstract-away the differences, hiding the difference in lower level
code.
I have described the new _bt_first behavior that is now available in
this new v3 of the patch as "adding explicit next key probes". While
v3 does make new changes to _bt_first, it's not really a special kind
of index probe. v3 invents new sentinel values instead.
The use of sentinels avoids inventing true special cases: the values
-inf, +inf, as well as variants of = that use a real datum value, but
match on the next key in the index. These new = variants can be
thought of as "+infinitesimal" values. So when _bt_advance_array_keys
has to "increment" the numeric value 5.0, it sets the scan key to the
value "5.0 +infinitesimal". There can never be any matching tuples in
the index (just like with -inf sentinel values), but that doesn't
matter. So the changes v3 makes to _bt_first doesn't change the basic
conceptual model. The added complexity is kept to a manageable level,
particularly within _bt_advance_array_keys, which is already very
complicated.
To help with testing and review, I've added another temporary testing
GUC to v3: skipscan_skipsupport_enabled. This can be set to "false" to
avoid using skip support, even where available. The GUC makes it easy
to measure how skip support routines can help performance (with
discrete types like integer and date).
I found the cost is estimated to much higher if the number of skipped attributes
is more than two. Is it expected behavior?
Yes and no.
Honestly, the current costing is just placeholder code. It is totally
inadequate. I'm not surprised that you found problems with it. I just
didn't put much work into it, because I didn't really know what to do.
# Test result. The attached file is the detail of tests.
-- Index Scan
-- The actual time is low since the skip scan works well
-- But the cost is higher than one of seqscan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991 loops=1)
Output: id1, id2, id3, value
Index Cond: (test.id3 = 101)
Buffers: shared hit=4402
Planning:
Buffers: shared hit=7
Planning Time: 0.234 ms
Execution Time: 15.711 ms
(8 rows)
This is a useful example, because it shows the difficulty with the
costing. I ran this query using my own custom instrumentation of the
scan. I saw that we only ever manage to skip ahead by perhaps 3 leaf
pages at a time, but we still come out ahead. As you pointed out, it's
~7.5x faster than the sequential scan, but not very different to the
equivalent full index scan. At least not very different in terms of
leaf page accesses. Why should we win by this much, for what seems
like a marginal case for skip scan?
Even cases where "skipping" doesn't manage to skip any leaf pages can
still benefit from skipping *index tuples* -- there is more than one
kind of skipping to consider. That is, the patch helps a lot with some
(though not all) cases where I didn't really expect that to happen:
the Postgres 17 SAOP tuple skipping code (the code in
_bt_checkkeys_look_ahead, and the related code in _bt_readpage) helps
quite a bit in "marginal" skip scan cases, even though it wasn't
really designed for that purpose (it was added to avoid regressions in
SAOP array scans for the Postgres 17 work).
I find that some queries using my original example test case are about
twice as fast as an equivalent full index scan, even when only the
fourth and final index column is used in the query predicate. The scan
can't even skip a single leaf page at a time, and yet we still win by
a nice amount. We win, though it is almost by mistake!
This is mostly a good thing. Both for the obvious reason (fast is
better than slow), and because it justifies being so aggressive in
assuming that skip scan might work out during planning (being wrong
without really losing is nice). But there is also a downside: it makes
it even harder to model costs at runtime, from within the optimizer.
If I measure the actual runtime costs other than runtime (e.g.,
buffers accesses), I'm not sure that the optimizer is wrong to think
that the parallel sequential scan is faster. It looks approximately
correct. It is only when we look at runtime that the optimizer's
choice looks wrong. Which is...awkward.
In general, I have very little idea about how to improve the costing
within btcostestimate. I am hoping that somebody has better ideas
about it. btcostestimate is definitely the area where the patch is
weakest right now.
I look at btcostestimate() to find the reason and found the bound quals
and cost.num_sa_scans are different from my expectation.My assumption is
* bound quals is id3=XXX (and id1 and id2 are skipped attributes)
* cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans
per skipped attribute)But it's wrong. The above index scan result is
* bound quals is NULL
* cost.num_sa_scans = 1
The logic with cost.num_sa_scans was definitely not what I intended.
That's fixed in v3, at least. But the code in btcostestimate is still
essentially the same as in earlier versions -- it needs to be
completely redesigned (or, uh, designed for the first time).
As I know you said the below, but I'd like to know the above is expected or not.
Currently, there is an assumption that "there will be 10 primitive index scans
per skipped attribute". Is any chance to use pg_stats.n_distinct?
It probably makes sense to use pg_stats.n_distinct here. But how?
If the problem is that we're too pessimistic, then I think that this
will usually (though not always) make us more pessimistic. Isn't that
the wrong direction to go in? (We're probably also too optimistic in
some cases, but being too pessimistic is a bigger problem in
practice.)
For example, your test case involved 11 distinct values in each
column. The current approach of hard-coding 10 (which is just a
temporary hack) should actually make the scan look a bit cheaper than
it would if we used the true ndistinct.
Another underlying problem is that the existing SAOP costing really
isn't very accurate, without skip scan -- that's a big source of the
pessimism with arrays/skipping. Why should we be able to get the true
number of primitive index scans just by multiplying together each
omitted prefix column's ndistinct? That approach is good for getting
the worst case, which is probably relevant -- but it's probably not a
very good assumption for the average case. (Though at least we can cap
the total number of primitive index scans to 1/3 of the total number
of pages in the index in btcostestimate, since we have guarantees
about the worst case as of Postgres 17.)
--
Peter Geoghegan
Attachments:
v3-0001-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v3-0001-Add-skip-scan-to-nbtree.patchDownload
From 35672d7b6a8fa7e78341d7f6580474693a6afd7d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 16 Apr 2024 13:21:36 -0400
Subject: [PATCH v3] Add skip scan to nbtree.
Skip scan allows nbtree index scans to efficiently use a composite index
on an index (a, b) for queries with a predicate such as "WHERE b = 5".
This is useful in cases where the total number of distinct values in the
column 'a' is reasonably small (think hundreds, possibly thousands).
In effect, a skip scan treats the composite index on (a, b) as if it was
a series of disjunct subindexes -- one subindex per distinct 'a' value.
We exhaustively "search every subindex" using a qual that behaves like
"WHERE a = ANY(<every possible 'a' value>) AND b = 5".
The design of skip scan works by extended the design for arrays
established by commit 5bf748b8. "Skip arrays" generate their array
values procedurally and on-demand, but otherwise work just like arrays
used by SAOPs.
B-Tree operator classes on discrete types can now optionally provide a
skip support routine. This is used to generate the next array element
value by incrementing the current value (or by decrementing, in the case
of backwards scans). When the opclass lacks a skip support routine, we
use sentinel next-key values instead. Adding skip support makes skip
scans more efficient in cases where there is naturally a good chance
that the very next value will find matching tuples. For example, during
an index scan with a leading "sales_date" attribute, there is a decent
chance that a scan that just finished returning tuples matching
"sales_date = '2024-06-01' and id = 5000" will find later tuples
matching "sales_date = '2024-06-02' and id = 5000". It is to our
advantage to skip straight to the relevant "id = 5000" leaf page,
totally avoiding reading earlier "sales_date = '2024-06-02'" leaf pages.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
---
src/include/access/nbtree.h | 24 +-
src/include/catalog/pg_amproc.dat | 16 +
src/include/catalog/pg_proc.dat | 24 +
src/include/utils/skipsupport.h | 124 ++
src/backend/access/nbtree/nbtcompare.c | 201 +++
src/backend/access/nbtree/nbtree.c | 10 +-
src/backend/access/nbtree/nbtsearch.c | 130 +-
src/backend/access/nbtree/nbtutils.c | 1666 +++++++++++++++++--
src/backend/access/nbtree/nbtvalidate.c | 4 +
src/backend/commands/opclasscmds.c | 25 +
src/backend/utils/adt/Makefile | 1 +
src/backend/utils/adt/date.c | 34 +
src/backend/utils/adt/meson.build | 1 +
src/backend/utils/adt/selfuncs.c | 30 +-
src/backend/utils/adt/skipsupport.c | 54 +
src/backend/utils/adt/uuid.c | 65 +
src/backend/utils/misc/guc_tables.c | 23 +
doc/src/sgml/btree.sgml | 13 +
doc/src/sgml/xindex.sgml | 16 +-
src/test/regress/expected/alter_generic.out | 6 +-
src/test/regress/expected/psql.out | 3 +-
src/test/regress/sql/alter_generic.sql | 2 +-
src/tools/pgindent/typedefs.list | 3 +
23 files changed, 2293 insertions(+), 182 deletions(-)
create mode 100644 src/include/utils/skipsupport.h
create mode 100644 src/backend/utils/adt/skipsupport.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 749304334..7cd5902cf 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+#include "utils/skipsupport.h"
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -709,7 +710,8 @@ BTreeTupleGetMaxHeapTID(IndexTuple itup)
#define BTINRANGE_PROC 3
#define BTEQUALIMAGE_PROC 4
#define BTOPTIONS_PROC 5
-#define BTNProcs 5
+#define BTSKIPSUPPORT_PROC 6
+#define BTNProcs 6
/*
* We need to be able to tell the difference between read and write
@@ -1032,9 +1034,18 @@ typedef BTScanPosData *BTScanPos;
typedef struct BTArrayKeyInfo
{
int scan_key; /* index of associated key in keyData */
+ int num_elems; /* number of elems (-1 for skip array) */
+
+ /* State used by standard arrays that store elements in memory */
int cur_elem; /* index of current element in elem_values */
- int num_elems; /* number of elems in current array value */
Datum *elem_values; /* array of num_elems Datums */
+
+ /* State used by skip arrays, which generate elements procedurally */
+ bool use_sksup; /* sksup set to valid routine? */
+ bool null_elem; /* lowest/highest element actually NULL? */
+ SkipSupportData sksup; /* opclass skip scan support, when use_sksup */
+ ScanKey low_compare; /* !use_sksup > and >= lower bound */
+ ScanKey high_compare; /* !use_sksup < and <= upper bound */
} BTArrayKeyInfo;
typedef struct BTScanOpaqueData
@@ -1123,6 +1134,11 @@ typedef struct BTReadPageState
*/
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
+#define SK_BT_SKIP 0x00040000 /* SK_SEARCHARRAY skip scan key */
+#define SK_BT_NEG_INF 0x00080000 /* -inf search SK_SEARCHARRAY */
+#define SK_BT_POS_INF 0x00100000 /* +inf search SK_SEARCHARRAY */
+#define SK_BT_NEXTKEY 0x00200000 /* key after sk_argument */
+#define SK_BT_PREVKEY 0x00400000 /* key before sk_argument */
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1159,6 +1175,10 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+/* GUC parameters (just a temporary convenience for reviewers) */
+extern PGDLLIMPORT int skipscan_prefix_cols;
+extern PGDLLIMPORT bool skipscan_skipsupport_enabled;
+
/*
* external entry points for btree, in nbtree.c
*/
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index f639c3a6a..2a8f6f3f1 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -21,6 +21,8 @@
amprocrighttype => 'bit', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '1', amproc => 'btboolcmp' },
+{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
+ amprocrighttype => 'bool', amprocnum => '6', amproc => 'btboolskipsupport' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
@@ -41,12 +43,16 @@
amprocrighttype => 'char', amprocnum => '1', amproc => 'btcharcmp' },
{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'btcharskipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1', amproc => 'date_cmp' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '2', amproc => 'date_sortsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'date_skipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'timestamp', amprocnum => '1',
amproc => 'date_cmp_timestamp' },
@@ -122,6 +128,8 @@
amprocrighttype => 'int2', amprocnum => '2', amproc => 'btint2sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'btint2skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint24cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
@@ -141,6 +149,8 @@
amprocrighttype => 'int4', amprocnum => '2', amproc => 'btint4sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'btint4skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint48cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
@@ -160,6 +170,8 @@
amprocrighttype => 'int8', amprocnum => '2', amproc => 'btint8sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'btint8skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint84cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
@@ -193,6 +205,8 @@
amprocrighttype => 'oid', amprocnum => '2', amproc => 'btoidsortsupport' },
{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'btoidskipsupport' },
{ amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
amprocrighttype => 'oidvector', amprocnum => '1',
amproc => 'btoidvectorcmp' },
@@ -261,6 +275,8 @@
amprocrighttype => 'uuid', amprocnum => '2', amproc => 'uuid_sortsupport' },
{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'uuid_skipsupport' },
{ amprocfamily => 'btree/record_ops', amproclefttype => 'record',
amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordcmp' },
{ amprocfamily => 'btree/record_image_ops', amproclefttype => 'record',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 73d9cf858..27921e0df 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1004,18 +1004,27 @@
{ oid => '3129', descr => 'sort support',
proname => 'btint2sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint2sortsupport' },
+{ oid => '9290', descr => 'skip support',
+ proname => 'btint2skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint2skipsupport' },
{ oid => '351', descr => 'less-equal-greater',
proname => 'btint4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int4 int4', prosrc => 'btint4cmp' },
{ oid => '3130', descr => 'sort support',
proname => 'btint4sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint4sortsupport' },
+{ oid => '9291', descr => 'skip support',
+ proname => 'btint4skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint4skipsupport' },
{ oid => '842', descr => 'less-equal-greater',
proname => 'btint8cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int8 int8', prosrc => 'btint8cmp' },
{ oid => '3131', descr => 'sort support',
proname => 'btint8sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint8sortsupport' },
+{ oid => '9292', descr => 'skip support',
+ proname => 'btint8skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint8skipsupport' },
{ oid => '354', descr => 'less-equal-greater',
proname => 'btfloat4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'float4 float4', prosrc => 'btfloat4cmp' },
@@ -1034,12 +1043,18 @@
{ oid => '3134', descr => 'sort support',
proname => 'btoidsortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btoidsortsupport' },
+{ oid => '9293', descr => 'skip support',
+ proname => 'btoidskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btoidskipsupport' },
{ oid => '404', descr => 'less-equal-greater',
proname => 'btoidvectorcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'oidvector oidvector', prosrc => 'btoidvectorcmp' },
{ oid => '358', descr => 'less-equal-greater',
proname => 'btcharcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'char char', prosrc => 'btcharcmp' },
+{ oid => '9294', descr => 'skip support',
+ proname => 'btcharskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btcharskipsupport' },
{ oid => '359', descr => 'less-equal-greater',
proname => 'btnamecmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'name name', prosrc => 'btnamecmp' },
@@ -2214,6 +2229,9 @@
{ oid => '3136', descr => 'sort support',
proname => 'date_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'date_sortsupport' },
+{ oid => '9295', descr => 'skip support',
+ proname => 'date_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'date_skipsupport' },
{ oid => '4133', descr => 'window RANGE support',
proname => 'in_range', prorettype => 'bool',
proargtypes => 'date date interval bool bool',
@@ -4368,6 +4386,9 @@
{ oid => '1693', descr => 'less-equal-greater',
proname => 'btboolcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'bool bool', prosrc => 'btboolcmp' },
+{ oid => '9296', descr => 'skip support',
+ proname => 'btboolskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btboolskipsupport' },
{ oid => '1688', descr => 'hash',
proname => 'time_hash', prorettype => 'int4', proargtypes => 'time',
@@ -9192,6 +9213,9 @@
{ oid => '3300', descr => 'sort support',
proname => 'uuid_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'uuid_sortsupport' },
+{ oid => '9297', descr => 'skip support',
+ proname => 'uuid_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'uuid_skipsupport' },
{ oid => '2961', descr => 'I/O',
proname => 'uuid_recv', prorettype => 'uuid', proargtypes => 'internal',
prosrc => 'uuid_recv' },
diff --git a/src/include/utils/skipsupport.h b/src/include/utils/skipsupport.h
new file mode 100644
index 000000000..ab79acb8c
--- /dev/null
+++ b/src/include/utils/skipsupport.h
@@ -0,0 +1,124 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.h
+ * Support routines for B-Tree skip scans.
+ *
+ * B-Tree operator classes for discrete types can optionally provide a support
+ * function for skipping. This is used during skip scans.
+ *
+ * A B-tree operator class that implements skip support provides B-tree index
+ * scans with a way of enumerating and iterating through every possible value
+ * from the domain of indexable values. This gives scans a way to determine
+ * the next value in line for a given skip array/scan key/skipped attribute.
+ * This happens at the point where the scan determines that another primitive
+ * index scan is required. The next value is used (in combination with at
+ * least one additional lower-order non-skip key, taken from the SQL query) to
+ * relocate the scan, skipping over many irrelevant leaf pages in the process.
+ *
+ * There are many data types/opclasses where implementing a skip support
+ * scheme is inherently impossible (or at least impractical). Obviously, it
+ * would be wrong if the "next" value generated by an opclass was actually
+ * after the true next value (any index tuples with the true next value would
+ * be overlooked by the index scan).
+ *
+ * Skip scan generally works best with discrete types such as integer, date,
+ * and boolean: types where there is a decent chance that indexes will contain
+ * contiguous values (in respect of the leading/skipped index attribute).
+ * When gaps/discontinuities are naturally rare (e.g., a leading identity
+ * column in a composite index, a date column preceding a product_id column),
+ * then it makes sense for the skip scan to optimistically assume that the
+ * next distinct indexable value will find directly matching index tuples.
+ * The B-Tree code can fall back on next-key probes for any opclass that
+ * doesn't include a skip support function, but it's a good idea to provide
+ * skip support for types that are likely to see benefits.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/skipsupport.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SKIPSUPPORT_H
+#define SKIPSUPPORT_H
+
+#include "utils/relcache.h"
+
+typedef struct SkipSupportData *SkipSupport;
+
+/*
+ * State/callbacks used by skip arrays to procedurally generate elements.
+ *
+ * A BTSKIPSUPPORT_PROC function must set each and every field when called.
+ * If an opclass can only set some of the fields, then it cannot safely
+ * provide a skip support routine (and so must rely on the fallback strategy
+ * used by continuous types, such as numeric).
+ */
+typedef struct SkipSupportData
+{
+ /*
+ * low_elem and high_elem must be set with the lowest and highest possible
+ * values from the domain of indexable values (assuming standard ascending
+ * order). This helps the B-Tree code with finding its initial position
+ * at the leaf level (during the skip scan's first primitive index scan).
+ * In other words, it gives the B-Tree code a useful value to start from,
+ * before any data has been read from the index.
+ *
+ * low_elem and high_elem can also be used to prove that a qual is
+ * unsatisfiable in certain cross-type scenarios.
+ *
+ * low_elem and high_elem are also used by skip scans to determine when
+ * they've reached the final possible value (in the current direction).
+ * It's typical for the scan to run out of leaf pages before it runs out
+ * of unscanned indexable values, but it's still useful for the scan to
+ * have a way to recognize when it has reached the last possible value
+ * (this saves us a useless probe that just lands on the final leaf page).
+ *
+ * Note: the logic for determining that the scan has reached the final
+ * possible value naturally belongs in the B-Tree code. The final value
+ * isn't necessarily the original high_elem/low_elem set by the opclass.
+ * In particular, it'll be a lower/higher value when B-Tree preprocessing
+ * determines that the true range of possible values should be restricted,
+ * due to the presence of an inequality applied to the index's skipped
+ * attribute. These are range skip scans.
+ */
+ Datum low_elem; /* lowest sorting/leftmost non-NULL value */
+ Datum high_elem; /* highest sorting/rightmost non-NULL value */
+
+ /*
+ * Decrement/increment functions.
+ *
+ * Returns a decremented/incremented copy of caller's existing datum,
+ * allocated in caller's memory context (in the case of pass-by-reference
+ * types). It's not okay for these functions to leak any memory.
+ *
+ * Both decrement and increment callbacks are guaranteed to never be
+ * called with a NULL "existing" arg. (In general it is the B-Tree code's
+ * job to worry about NULLs, and about whether indexed values are stored
+ * in ASC order or DESC order.)
+ *
+ * The decrement callback is guaranteed to only be called with an
+ * "existing" value that's strictly > the low_elem set by the opclass.
+ * Similarly, the increment callback is guaranteed to only be called with
+ * an "existing" value that's strictly < the high_elem set by the opclass.
+ * Consequently, opclasses don't have to deal with "overflow" themselves
+ * (though asserting that the B-Tree code got it right is a good idea).
+ *
+ * It's quite possible (and very common) for the B-Tree skip scan caller's
+ * "existing" datum to just be a straight copy of a value that it copied
+ * from the index. Operator classes must be liberal in accepting every
+ * possible representational variation within the underlying data type.
+ * Opclasses don't have to preserve whatever semantically insignificant
+ * information the data type might be carrying around, though.
+ *
+ * Note: < and > are defined by the opclass's ORDER proc in the usual way.
+ */
+ Datum (*decrement) (Relation rel, Datum existing);
+ Datum (*increment) (Relation rel, Datum existing);
+} SkipSupportData;
+
+extern bool PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype,
+ bool reverse, SkipSupport sksup);
+
+#endif /* SKIPSUPPORT_H */
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 1c72867c8..48a877613 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -58,6 +58,7 @@
#include <limits.h>
#include "utils/fmgrprotos.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#ifdef STRESS_SORT_INT_MIN
@@ -78,6 +79,39 @@ btboolcmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32((int32) a - (int32) b);
}
+static Datum
+bool_decrement(Relation rel, Datum existing)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ Assert(bexisting == true);
+
+ return BoolGetDatum(bexisting - 1);
+}
+
+static Datum
+bool_increment(Relation rel, Datum existing)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ Assert(bexisting == false);
+
+ return BoolGetDatum(bexisting + 1);
+}
+
+Datum
+btboolskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = bool_decrement;
+ sksup->increment = bool_increment;
+ sksup->low_elem = BoolGetDatum(false);
+ sksup->high_elem = BoolGetDatum(true);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint2cmp(PG_FUNCTION_ARGS)
{
@@ -105,6 +139,39 @@ btint2sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int2_decrement(Relation rel, Datum existing)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ Assert(iexisting > PG_INT16_MIN);
+
+ return Int16GetDatum(iexisting - 1);
+}
+
+static Datum
+int2_increment(Relation rel, Datum existing)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ Assert(iexisting < PG_INT16_MAX);
+
+ return Int16GetDatum(iexisting + 1);
+}
+
+Datum
+btint2skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int2_decrement;
+ sksup->increment = int2_increment;
+ sksup->low_elem = Int16GetDatum(PG_INT16_MIN);
+ sksup->high_elem = Int16GetDatum(PG_INT16_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint4cmp(PG_FUNCTION_ARGS)
{
@@ -128,6 +195,39 @@ btint4sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int4_decrement(Relation rel, Datum existing)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ Assert(iexisting > PG_INT32_MIN);
+
+ return Int32GetDatum(iexisting - 1);
+}
+
+static Datum
+int4_increment(Relation rel, Datum existing)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ Assert(iexisting < PG_INT32_MAX);
+
+ return Int32GetDatum(iexisting + 1);
+}
+
+Datum
+btint4skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int4_decrement;
+ sksup->increment = int4_increment;
+ sksup->low_elem = Int32GetDatum(PG_INT32_MIN);
+ sksup->high_elem = Int32GetDatum(PG_INT32_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint8cmp(PG_FUNCTION_ARGS)
{
@@ -171,6 +271,39 @@ btint8sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int8_decrement(Relation rel, Datum existing)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ Assert(iexisting > PG_INT64_MIN);
+
+ return Int64GetDatum(iexisting - 1);
+}
+
+static Datum
+int8_increment(Relation rel, Datum existing)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ Assert(iexisting < PG_INT64_MAX);
+
+ return Int64GetDatum(iexisting + 1);
+}
+
+Datum
+btint8skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int8_decrement;
+ sksup->increment = int8_increment;
+ sksup->low_elem = Int64GetDatum(PG_INT64_MIN);
+ sksup->high_elem = Int64GetDatum(PG_INT64_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint48cmp(PG_FUNCTION_ARGS)
{
@@ -292,6 +425,39 @@ btoidsortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+oid_decrement(Relation rel, Datum existing)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ Assert(oexisting > InvalidOid);
+
+ return ObjectIdGetDatum(oexisting - 1);
+}
+
+static Datum
+oid_increment(Relation rel, Datum existing)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ Assert(oexisting < OID_MAX);
+
+ return ObjectIdGetDatum(oexisting + 1);
+}
+
+Datum
+btoidskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = oid_decrement;
+ sksup->increment = oid_increment;
+ sksup->low_elem = ObjectIdGetDatum(InvalidOid);
+ sksup->high_elem = ObjectIdGetDatum(OID_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btoidvectorcmp(PG_FUNCTION_ARGS)
{
@@ -325,3 +491,38 @@ btcharcmp(PG_FUNCTION_ARGS)
/* Be careful to compare chars as unsigned */
PG_RETURN_INT32((int32) ((uint8) a) - (int32) ((uint8) b));
}
+
+static Datum
+char_decrement(Relation rel, Datum existing)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ Assert(cexisting > 0);
+
+ return CharGetDatum((uint8) cexisting - 1);
+}
+
+static Datum
+char_increment(Relation rel, Datum existing)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ Assert(cexisting < UCHAR_MAX);
+
+ return CharGetDatum((uint8) cexisting + 1);
+}
+
+Datum
+btcharskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = char_decrement;
+ sksup->increment = char_increment;
+
+ /* btcharcmp compares chars as unsigned */
+ sksup->low_elem = UInt8GetDatum(0);
+ sksup->high_elem = UInt8GetDatum(UCHAR_MAX);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f..9c9cd48f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -324,11 +324,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
BTScanPosInvalidate(so->currPos);
BTScanPosInvalidate(so->markPos);
- if (scan->numberOfKeys > 0)
- so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
- else
- so->keyData = NULL;
+ so->keyData = NULL;
so->needPrimScan = false;
so->scanBehind = false;
so->arrayKeys = NULL;
@@ -408,6 +405,11 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->numberOfKeys = 0; /* until _bt_preprocess_keys sets it */
so->numArrayKeys = 0; /* ditto */
+
+ /* Release private storage allocated in previous btrescan, if any */
+ if (so->keyData != NULL)
+ pfree(so->keyData);
+ so->keyData = NULL;
}
/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 57bcfc7e4..f1bb4e8ee 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -880,7 +880,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Buffer buf;
BTStack stack;
OffsetNumber offnum;
- StrategyNumber strat;
BTScanInsertData inskey;
ScanKey startKeys[INDEX_MAX_KEYS];
ScanKeyData notnullkeys[INDEX_MAX_KEYS];
@@ -1022,6 +1021,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
ScanKey chosen;
ScanKey impliesNN;
ScanKey cur;
+ int ikey = 0,
+ ichosen = 0;
/*
* chosen is the so-far-chosen key for the current attribute, if any.
@@ -1042,6 +1043,96 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
{
if (i >= so->numberOfKeys || cur->sk_attno != curattr)
{
+ /*
+ * Conceptually, skip arrays consist of array elements whose
+ * values are generated procedurally and on demand. We need
+ * special handling for that here.
+ *
+ * We must interpret various sentinel values to generate an
+ * insertion scan key. This is only actually needed for index
+ * attributes whose input opclass lacks a skip support routine
+ * (when skip support is available we'll always be able to
+ * generate true array element datum values instead).
+ */
+ if (chosen && chosen->sk_flags & (SK_BT_NEG_INF | SK_BT_POS_INF))
+ {
+ BTArrayKeyInfo *array = NULL;
+
+ Assert(chosen->sk_flags & SK_BT_SKIP);
+ Assert(!(chosen->sk_flags & (SK_BT_NEXTKEY | SK_BT_PREVKEY)));
+
+ for (; ikey < so->numArrayKeys; ikey++)
+ {
+ array = &so->arrayKeys[ikey];
+ if (array->scan_key == ichosen)
+ break;
+ }
+
+ Assert(array->scan_key == ichosen);
+ Assert(array->num_elems == -1);
+ Assert(!array->use_sksup);
+
+ if (array->null_elem)
+ {
+ /*
+ * Treat the chosen scan key as having the value -inf
+ * (or the value +inf, in the backwards scan case) by
+ * not appending it to the local startKeys[] array.
+ *
+ * Note: we expect one or more lower-order required
+ * keys that won't influence initial positioning (for
+ * this primitive index scan). There cannot possibly
+ * be non-pivot tuples that have values matching -inf,
+ * though, so this "omission" can have no real impact.
+ *
+ * Note: This array has a NULL element, which means
+ * that there must be no upper/lower inequalities.
+ * Assert that prepprocessing got this right.
+ */
+ Assert(!array->low_compare);
+ Assert(!array->high_compare);
+ break; /* done adding entries to startKeys[] */
+ }
+ else if ((chosen->sk_flags & SK_BT_NEG_INF) &&
+ array->low_compare)
+ {
+ Assert(ScanDirectionIsForward(dir));
+
+ /* use array's inequality key in startKeys[] */
+ chosen = array->low_compare;
+ }
+ else if ((chosen->sk_flags & SK_BT_POS_INF) &&
+ array->high_compare)
+ {
+ Assert(ScanDirectionIsBackward(dir));
+
+ /* use array's inequality key in startKeys[] */
+ chosen = array->high_compare;
+ }
+ else
+ {
+ /*
+ * Array starts at (or ends just before) any non-NULL
+ * values. Deduce a NOT NULL key to skip over NULLs.
+ *
+ * Note: range skip arrays generated using an explicit
+ * IS NOT NULL input scan key against an otherwise
+ * omitted prefix attribute use this path, too.
+ */
+ impliesNN = chosen;
+ chosen = NULL;
+ }
+
+ /*
+ * We'll add the chosen inequality (or a deduced NOT NULL
+ * key) to startKeys[] below.
+ *
+ * Note: we usually won't be able to add any additional
+ * scan keys for index attributes beyond this one. This
+ * is okay for the same reason as the -inf/+inf case.
+ */
+ }
+
/*
* Done looking at keys for curattr. If we didn't find a
* usable boundary key, see if we can deduce a NOT NULL key.
@@ -1075,16 +1166,41 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
break;
startKeys[keysz++] = chosen;
+ /*
+ * Skip arrays can also use a sk_argument which is marked
+ * "next key". This is another sentinel array element value
+ * requiring special handling here by us. As with -inf/+inf
+ * sentinels, there cannot be any exact non-pivot matches.
+ */
+ if (chosen->sk_flags & (SK_BT_NEXTKEY | SK_BT_PREVKEY))
+ {
+ Assert(chosen->sk_flags & SK_BT_SKIP);
+ Assert(!(chosen->sk_flags & (SK_BT_NEG_INF | SK_BT_POS_INF)));
+ Assert(chosen->sk_strategy == BTEqualStrategyNumber);
+
+ /*
+ * Adjust strat_total, so that our = key gets treated like
+ * a > key (or like a < key).
+ *
+ * The key is still conceptually a = key; we only do this
+ * because there's no explicit next/prev key we can use.
+ */
+ if (chosen->sk_flags & SK_BT_NEXTKEY)
+ strat_total = BTGreaterStrategyNumber;
+ else
+ strat_total = BTLessStrategyNumber;
+ break;
+ }
+
/*
* Adjust strat_total, and quit if we have stored a > or <
* key.
*/
- strat = chosen->sk_strategy;
- if (strat != BTEqualStrategyNumber)
+ if (chosen->sk_strategy != BTEqualStrategyNumber)
{
- strat_total = strat;
- if (strat == BTGreaterStrategyNumber ||
- strat == BTLessStrategyNumber)
+ strat_total = chosen->sk_strategy;
+ if (chosen->sk_strategy == BTGreaterStrategyNumber ||
+ chosen->sk_strategy == BTLessStrategyNumber)
break;
}
@@ -1103,6 +1219,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
curattr = cur->sk_attno;
chosen = NULL;
impliesNN = NULL;
+ ichosen = -1;
}
/*
@@ -1127,6 +1244,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
case BTEqualStrategyNumber:
/* override any non-equality choice */
chosen = cur;
+ ichosen = i;
break;
case BTGreaterEqualStrategyNumber:
case BTGreaterStrategyNumber:
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d6de2072d..133cb4687 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -29,9 +29,50 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+/*
+ * GUC parameters (temporary convenience for reviewers).
+ *
+ * To disable all skipping, set skipscan_prefix_cols=0. Otherwise set it to
+ * the attribute number that you wish to make the last attribute number that
+ * we can add a skip scan key for.
+ *
+ * For example, setting skipscan_prefix_cols=1 before an index scan with qual
+ * "WHERE b = 1 AND c > 42" will make us generate a skip scan key on the
+ * column 'a' (which is attnum 1) only, preventing us from adding one for the
+ * column 'c' (and so 'c' will still have an inequality scan key, required in
+ * only one direction -- 'c' won't be output as a "range" skip key/array).
+ *
+ * The same scan keys will be output when skipscan_prefix_cols=2, given the
+ * same query/qual, since we naturally get a required equality scan key on 'b'
+ * from the input scan keys (provided we at least manage to add a skip scan
+ * key on 'a' that "anchors its required-ness" to the 'b' scan key.)
+ *
+ * When skipscan_prefix_cols is set to the number of key columns in the index,
+ * we're as aggressive as possible about adding skip scan arrays/scan keys.
+ * This is the current default behavior, and the behavior we're targeting for
+ * the committed patch (if there are slowdowns from being maximally aggressive
+ * here then the likely solution is to make _bt_advance_array_keys adaptive,
+ * rather than trying to predict what will work during preprocessing).
+ */
+int skipscan_prefix_cols = INDEX_MAX_KEYS;
+
+/*
+ * skipscan_skipsupport_enabled can be used to avoid using opclass skip
+ * support routines. This can be used to quantify the peformance benefit that
+ * comes from having dedicated skip support, with a given test query.
+ */
+bool skipscan_skipsupport_enabled = true;
+
#define LOOK_AHEAD_REQUIRED_RECHECKS 3
#define LOOK_AHEAD_DEFAULT_DISTANCE 5
+typedef struct BTSkipPreproc
+{
+ SkipSupportData sksup; /* opclass skip scan support */
+ bool use_sksup; /* sksup set to valid routine? */
+ Oid eq_op; /* InvalidOid means don't skip */
+} BTSkipPreproc;
+
typedef struct BTSortArrayContext
{
FmgrInfo *sortproc;
@@ -62,17 +103,48 @@ static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
ScanKey arraysk, ScanKey skey,
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok);
-static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys);
static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
+static int _bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts);
+static bool _bt_skip_support(Relation rel, int add_skip_attno,
+ BTSkipPreproc *skipatts);
+static inline Datum _bt_apply_decrement(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array);
+static inline Datum _bt_apply_increment(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur);
+ Datum arrdatum, bool arrnull,
+ ScanKey cur);
+static void _bt_apply_compare_array(ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
+static void _bt_apply_compare_skiparray(IndexScanDesc scan, ScanKey arraysk,
+ ScanKey skey, FmgrInfo *orderproc,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
static int _bt_binsrch_array_skey(FmgrInfo *orderproc,
bool cur_elem_trig, ScanDirection dir,
Datum tupdatum, bool tupnull,
BTArrayKeyInfo *array, ScanKey cur,
int32 *set_elem_result);
+static void _bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result);
+static void _bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_low_or_high(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array, bool low_not_high);
+static bool _bt_scankey_skip_increment(Relation rel, ScanDirection dir,
+ BTArrayKeyInfo *array, ScanKey skey,
+ FmgrInfo *orderproc);
+static void _bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull);
+static void _bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
@@ -251,9 +323,6 @@ _bt_freestack(BTStack stack)
* It is convenient for _bt_preprocess_keys caller to have to deal with no
* more than one equality strategy array scan key per index attribute. We'll
* always be able to set things up that way when complete opfamilies are used.
- * Eliminated array scan keys can be recognized as those that have had their
- * sk_strategy field set to InvalidStrategy here by us. Caller should avoid
- * including these in the scan's so->keyData[] output array.
*
* We set the scan key references from the scan's BTArrayKeyInfo info array to
* offsets into the temp modified input array returned to caller. Scans that
@@ -261,18 +330,36 @@ _bt_freestack(BTStack stack)
* preprocessing steps are complete. This will convert the scan key offset
* references into references to the scan's so->keyData[] output scan keys.
*
+ * We're also responsible for generating skip arrays (and their associated
+ * scan keys) here. This enables skip scan. We do this for index attributes
+ * that initially lacked an equality condition within scan->keyData[], iff
+ * doing so allows a later scan key (that was passed to us in scan->keyData[])
+ * to be marked required by later preprocessing on output.
+ * _bt_decide_skipatts decides which attributes receive skip arrays.
+ *
+ * Caller must pass *numberOfKeys to give us a way to change the number of
+ * input scan keys (our output is caller's input). The returned array can be
+ * smaller than scan->keyData[] when we eliminated a redundant array scan key
+ * (redundant with some other array scan key, for the same attribute). It can
+ * also be larger when we added a skip array/skip scan key. Caller uses this
+ * to allocate so->keyData[] for the current btrescan.
+ *
* Note: the reason we need to return a temp scan key array, rather than just
* scribbling on scan->keyData, is that callers are permitted to call btrescan
* without supplying a new set of scankey data.
*/
static ScanKey
-_bt_preprocess_array_keys(IndexScanDesc scan)
+_bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- int numberOfKeys = scan->numberOfKeys;
+ int numArrayKeyData = scan->numberOfKeys;
int16 *indoption = rel->rd_indoption;
- int numArrayKeys;
+ BTSkipPreproc skipatts[INDEX_MAX_KEYS];
+ int numArrayKeys,
+ numSkipArrayKeys,
+ output_ikey = 0;
+ AttrNumber attno_skip = 1;
int origarrayatt = InvalidAttrNumber,
origarraykey = -1;
Oid origelemtype = InvalidOid;
@@ -280,11 +367,14 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
MemoryContext oldContext;
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- Assert(numberOfKeys);
+ Assert(scan->numberOfKeys);
- /* Quick check to see if there are any array keys */
+ /*
+ * Quick check to see if there are any array keys, or any missing keys we
+ * can generate a "skip scan" array key for ourselves
+ */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int i = 0; i < scan->numberOfKeys; i++)
{
cur = &scan->keyData[i];
if (cur->sk_flags & SK_SEARCHARRAY)
@@ -300,6 +390,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
}
}
+ /* Consider generating skip arrays, and associated equality scan keys */
+ numSkipArrayKeys = _bt_decide_skipatts(scan, skipatts);
+ if (numSkipArrayKeys)
+ {
+ /* At least one skip array scan key must be added to arrayKeyData[] */
+ numArrayKeys += numSkipArrayKeys;
+ /* output scan key buffer allocation needs space for skip scan keys */
+ numArrayKeyData += numSkipArrayKeys;
+ }
+
/* Quit if nothing to do. */
if (numArrayKeys == 0)
return NULL;
@@ -317,19 +417,23 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
oldContext = MemoryContextSwitchTo(so->arrayContext);
- /* Create modifiable copy of scan->keyData in the workspace context */
- arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
- memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
+ /* Create output scan keys in the workspace context */
+ arrayKeyData = (ScanKey) palloc(numArrayKeyData * sizeof(ScanKeyData));
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
/* Allocate space for ORDER procs used to help _bt_checkkeys */
- so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+ so->orderProcs = (FmgrInfo *) palloc(numArrayKeyData * sizeof(FmgrInfo));
- /* Now process each array key */
+ /*
+ * Process each array key, and generate skip arrays as needed. Also copy
+ * every scan->keyData[] input scan key (whether it's an array or not)
+ * into the arrayKeyData array we'll return to our caller (barring any
+ * array scan keys that we could eliminate early through array merging).
+ */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int input_ikey = 0; input_ikey < scan->numberOfKeys; input_ikey++)
{
FmgrInfo sortproc;
FmgrInfo *sortprocp = &sortproc;
@@ -345,14 +449,88 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int num_nonnulls;
int j;
- cur = &arrayKeyData[i];
- if (!(cur->sk_flags & SK_SEARCHARRAY))
- continue;
+ /* Create a skip array and scan key where indicated by skipatts */
+ while (numSkipArrayKeys &&
+ attno_skip <= scan->keyData[input_ikey].sk_attno)
+ {
+ Oid opcintype = rel->rd_opcintype[attno_skip - 1];
+ Oid collation = rel->rd_indcollation[attno_skip - 1];
+ Oid eq_op = skipatts[attno_skip - 1].eq_op;
+ RegProcedure cmp_proc;
+
+ if (!OidIsValid(eq_op))
+ {
+ /* won't skip using this attribute */
+ attno_skip++;
+ continue;
+ }
+
+ cmp_proc = get_opcode(eq_op);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing oprcode for skipping equals operator %u", eq_op);
+
+ cur = &arrayKeyData[output_ikey];
+ Assert(attno_skip <= scan->keyData[input_ikey].sk_attno);
+ ScanKeyEntryInitialize(cur,
+ SK_SEARCHARRAY | SK_BT_SKIP, /* flags */
+ attno_skip, /* skipped att number */
+ BTEqualStrategyNumber, /* equality strategy */
+ InvalidOid, /* opclass input subtype */
+ collation, /* index column's collation */
+ cmp_proc, /* equality operator's proc */
+ (Datum) 0); /* constant */
+
+ /* Initialize array fields */
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
+ so->arrayKeys[numArrayKeys].num_elems = -1;
+ so->arrayKeys[numArrayKeys].cur_elem = 0;
+ so->arrayKeys[numArrayKeys].elem_values = NULL; /* unusued */
+ so->arrayKeys[numArrayKeys].use_sksup = skipatts[attno_skip - 1].use_sksup;
+ so->arrayKeys[numArrayKeys].null_elem = true; /* for now */
+ so->arrayKeys[numArrayKeys].sksup = skipatts[attno_skip - 1].sksup;
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* for now */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* for now */
+
+ /*
+ * Temporary testing GUC can disable the use of an opclass's skip
+ * support routine
+ */
+ if (!skipscan_skipsupport_enabled)
+ so->arrayKeys[numArrayKeys].use_sksup = false;
+
+ /*
+ * We'll need a 3-way ORDER proc to determine when and how the
+ * consed-up "array" will advance inside _bt_advance_array_keys.
+ * Set one up now.
+ */
+ _bt_setup_array_cmp(scan, cur, opcintype,
+ &so->orderProcs[output_ikey], NULL);
+
+ /*
+ * Prepare to output next scan key (might be another skip scan
+ * key, or it could be an input scan key from scan->keyData[])
+ */
+ numSkipArrayKeys--;
+ numArrayKeys++;
+ attno_skip++;
+ output_ikey++; /* keep this scan key/array */
+ }
/*
- * First, deconstruct the array into elements. Anything allocated
- * here (including a possibly detoasted array value) is in the
- * workspace context.
+ * Copy input scan key into temp arrayKeyData scan key array. (From
+ * here on, cur points at our copy of the input scan key.)
+ */
+ cur = &arrayKeyData[output_ikey];
+ *cur = scan->keyData[input_ikey];
+
+ if (!(cur->sk_flags & SK_SEARCHARRAY))
+ {
+ output_ikey++; /* keep this non-array scan key */
+ continue;
+ }
+
+ /*
+ * Deconstruct the array into elements
*/
arrayval = DatumGetArrayTypeP(cur->sk_argument);
/* We could cache this data, but not clear it's worth it */
@@ -406,6 +584,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTGreaterStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
case BTEqualStrategyNumber:
/* proceed with rest of loop */
@@ -416,6 +595,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTLessStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
default:
elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -432,7 +612,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* sortproc just points to the same proc used during binary searches.
*/
_bt_setup_array_cmp(scan, cur, elemtype,
- &so->orderProcs[i], &sortprocp);
+ &so->orderProcs[output_ikey], &sortprocp);
/*
* Sort the non-null elements and eliminate any duplicates. We must
@@ -476,11 +656,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
break;
}
- /*
- * Indicate to _bt_preprocess_keys caller that it must ignore
- * this scan key
- */
- cur->sk_strategy = InvalidStrategy;
+ /* Throw away this array */
continue;
}
@@ -511,12 +687,19 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* Note: _bt_preprocess_array_keys_final will fix-up each array's
* scan_key field later on, after so->keyData[] has been finalized.
*/
- so->arrayKeys[numArrayKeys].scan_key = i;
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
+ so->arrayKeys[numArrayKeys].null_elem = false; /* unused */
+ so->arrayKeys[numArrayKeys].use_sksup = false; /* redundant */
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* unused */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* unused */
numArrayKeys++;
+ output_ikey++; /* keep this scan key/array */
}
+ /* Set final number of arrayKeyData[] keys, array keys */
+ *numberOfKeys = output_ikey;
so->numArrayKeys = numArrayKeys;
MemoryContextSwitchTo(oldContext);
@@ -624,7 +807,8 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
{
BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
- Assert(array->num_elems > 0);
+ Assert(array->num_elems > 0 || array->num_elems == -1);
+ Assert(array->num_elems != -1 || outkey->sk_flags & SK_BT_REQFWD);
if (array->scan_key == input_ikey)
{
@@ -685,6 +869,241 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
so->numArrayKeys, INDEX_MAX_KEYS)));
}
+/*
+ * _bt_decide_skipatts() -- set index attributes requiring skip arrays
+ *
+ * _bt_preprocess_array_keys helper function. Determines which attributes
+ * will require skip arrays/scan keys. Also sets up skip support function for
+ * each of these attributes.
+ *
+ * This sets up "skip scan". Adding skip arrays (and associated scan keys)
+ * allows _bt_preprocess_keys to mark lower-order scan keys (copied from the
+ * original scan->keyData[] array in the conventional way) as required. The
+ * overall effect is to enable skipping over irrelevant sections of the index.
+ *
+ * Return value is the total number of scan keys to add as "input" scan keys
+ * for further processing within _bt_preprocess_keys.
+ */
+static int
+_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)
+{
+ Relation rel = scan->indexRelation;
+ ScanKey inputsk;
+ AttrNumber attno_inputsk = 1,
+ attno_skip = 1;
+ bool attno_has_equal = false,
+ attno_has_rowcompare = false;
+ int numSkipArrayKeys = 0,
+ prev_numSkipArrayKeys = 0;
+
+ Assert(scan->numberOfKeys);
+
+ /*
+ * FIXME Also don't support parallel scans for now. Must add logic to
+ * places like _bt_parallel_primscan_schedule so that we account for skip
+ * arrays when parallel workers serialize their array scan state.
+ */
+ if (scan->parallel_scan)
+ return 0;
+
+ inputsk = &scan->keyData[0];
+ for (int i = 0;; inputsk++, i++)
+ {
+ /*
+ * Backfill skip arrays for any wholly omitted attributes prior to
+ * attno_inputsk
+ */
+ while (attno_skip < attno_inputsk)
+ {
+ if (!_bt_skip_support(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /*
+ * Opclass lacks a suitable skip support routine.
+ *
+ * Return prev_numSkipArrayKeys, so as to avoid including any
+ * "backfilled" arrays that were supposed to form a contiguous
+ * group with a skip array on this attribute. There is no
+ * benefit to adding backfill skip arrays unless we can do so
+ * for all attributes (all attributes up to and including the
+ * one immediately before attno_inputsk).
+ */
+ return prev_numSkipArrayKeys;
+ }
+
+ /* plan on adding a backfill skip array for this attribute */
+ numSkipArrayKeys++;
+ attno_skip++;
+ }
+
+ /*
+ * Stop once past the final input scan key. We deliberately never add
+ * a skip attribute for the attribute of the last input scan key.
+ *
+ * If the last input scan key(s) use equality strategy, then a skip
+ * attribute is superfluous at best. If the last input scan key uses
+ * an inequality strategy, then adding a skip scan array/scan key is a
+ * valid though suboptimal transformation. It is better to arrange
+ * for preprocessing to allow such an input inequality scan key to
+ * remain an inequality on output. That way _bt_checkkeys will be
+ * able to make best use of both of its precheck optimizations, but
+ * _bt_first will be no less capable of efficiently finding the
+ * starting position for each primitive index scan.
+ */
+ if (i >= scan->numberOfKeys)
+ break;
+
+ /*
+ * Cannot keep adding skip arrays after a RowCompare
+ */
+ if (attno_has_rowcompare)
+ break;
+
+ /*
+ * Apply temporary testing GUC that can be used to disable skipping
+ * (either in part or in whole)
+ */
+ if (attno_inputsk > skipscan_prefix_cols)
+ break;
+
+ /*
+ * Now consider next attno_inputsk (or keep going if this is an
+ * additional scan key against the same attribute)
+ */
+ if (attno_inputsk < inputsk->sk_attno)
+ {
+ prev_numSkipArrayKeys = numSkipArrayKeys;
+
+ /*
+ * Now add skip array for previous scan key's attribute, though
+ * only if the attribute has no equality strategy scan keys.
+ *
+ * Adding skip arrays to an attribute that has one or more
+ * inequality scan keys will cause preprocessing to output a range
+ * skip array. This will happen when preprocessing proper deals
+ * with the redundancy between the array and its inequalities.
+ */
+ skipatts[attno_skip - 1].eq_op = InvalidOid;
+ if (!attno_has_equal)
+ {
+ /* Only saw inequalities for the prior attribute */
+ if (_bt_skip_support(rel, attno_skip,
+ &skipatts[attno_skip - 1]))
+ {
+ /* add a range skip array for this attribute */
+ numSkipArrayKeys++;
+ }
+ else
+ break;
+ }
+ else
+ {
+ /*
+ * Saw an equality for the prior attribute, so it doesn't need
+ * a skip array (not even a range skip array). We'll be able
+ * to add later skip arrays, too (doesn't matter if the prior
+ * attribute uses an input opclass without skip support).
+ */
+ }
+
+ /* Set things up for this new attribute */
+ attno_skip++;
+ attno_inputsk = inputsk->sk_attno;
+ attno_has_equal = false;
+ }
+
+ /*
+ * Track if this scan key's attribute has any equality strategy scan
+ * keys.
+ *
+ * Treat IS NULL scan keys as using equal strategy (they'll be marked
+ * as using it later on, by _bt_fix_scankey_strategy).
+ */
+ if (inputsk->sk_strategy == BTEqualStrategyNumber ||
+ (inputsk->sk_flags & SK_SEARCHNULL))
+ attno_has_equal = true;
+
+ /*
+ * We don't support RowCompare transformation. Remember that we saw a
+ * RowCompare, so that we don't keep adding skip attributes.
+ *
+ * We do still backfill skip attributes before the RowCompare, so that
+ * it can be marked required. This is similar to what happens when a
+ * conventional inequality uses an opclass that lacks skip support.
+ */
+ if (inputsk->sk_flags & SK_ROW_HEADER)
+ attno_has_rowcompare = true;
+ }
+
+ return numSkipArrayKeys;
+}
+
+/*
+ * _bt_skip_support() -- set up skip support function in *skipatts
+ *
+ * Returns true on success, indicating that we set *skipatts with input
+ * opclass's equality operator. Otherwise returns false.
+ */
+static bool
+_bt_skip_support(Relation rel, int add_skip_attno, BTSkipPreproc *skipatts)
+{
+ int16 *indoption = rel->rd_indoption;
+ Oid opfamily = rel->rd_opfamily[add_skip_attno - 1];
+ Oid opcintype = rel->rd_opcintype[add_skip_attno - 1];
+ bool reverse;
+
+ /* Look up input opclass's equality operator */
+ skipatts->eq_op = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ /*
+ * We don't expect input opclasses lacking even an equality operator, but
+ * it's possible. Deal with it gracefully.
+ */
+ if (!OidIsValid(skipatts->eq_op))
+ return false;
+
+ /* Have skip support infrastructure set all SkipSupport fields */
+ reverse = (indoption[add_skip_attno - 1] & INDOPTION_DESC) != 0;
+ skipatts->use_sksup = PrepareSkipSupportFromOpclass(opfamily, opcintype,
+ reverse,
+ &skipatts->sksup);
+
+ /* might not have set up skip support routine, but can skip either way */
+ return true;
+}
+
+/*
+ * _bt_apply_decrement() -- Get a decremented copy of skey's arg
+ *
+ * Note: this wrapper function calls the opclass increment function when the
+ * index stores values in descending order. We're "logically decrementing" to
+ * the previous value in the key space regardless.
+ */
+static inline Datum
+_bt_apply_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ if (!(skey->sk_flags & SK_BT_DESC))
+ return array->sksup.decrement(rel, skey->sk_argument);
+ else
+ return array->sksup.increment(rel, skey->sk_argument);
+}
+
+/*
+ * _bt_apply_increment() -- Get an incremented copy of skey's arg
+ *
+ * Note: this wrapper function calls the opclass decrement function when the
+ * index stores values in descending order. We're "logically incrementing" to
+ * the next value in the key space regardless.
+ */
+static inline Datum
+_bt_apply_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ if (!(skey->sk_flags & SK_BT_DESC))
+ return array->sksup.increment(rel, skey->sk_argument);
+ else
+ return array->sksup.decrement(rel, skey->sk_argument);
+}
+
/*
* _bt_setup_array_cmp() -- Set up array comparison functions
*
@@ -979,15 +1398,10 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
{
Relation rel = scan->indexRelation;
Oid opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
- int cmpresult = 0,
- cmpexact = 0,
- matchelem,
- new_nelems = 0;
FmgrInfo crosstypeproc;
FmgrInfo *orderprocp = orderproc;
Assert(arraysk->sk_attno == skey->sk_attno);
- Assert(array->num_elems > 0);
Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
arraysk->sk_strategy == BTEqualStrategyNumber);
@@ -1000,8 +1414,8 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
* datum of opclass input type for the index's attribute (on-disk type).
* We can reuse the array's ORDER proc whenever the non-array scan key's
* type is a match for the corresponding attribute's input opclass type.
- * Otherwise, we have to do another ORDER proc lookup so that our call to
- * _bt_binsrch_array_skey applies the correct comparator.
+ * Otherwise, we have to do another ORDER proc lookup. We have to be sure
+ * that _bt_compare_array_skey/_bt_binsrch_array_skey use the right proc.
*
* Note: we have to support the convention that sk_subtype == InvalidOid
* means the opclass input type; this is a hack to simplify life for
@@ -1032,11 +1446,45 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
return false;
}
- /* We have all we need to determine redundancy/contradictoriness */
orderprocp = &crosstypeproc;
fmgr_info(cmp_proc, orderprocp);
}
+ /*
+ * We have all we need to determine redundancy/contradictoriness.
+ *
+ * Perform preprocessing of the array based on whether it's a conventional
+ * array, or a skip array. Sets *qual_ok correctly in passing.
+ */
+ if (array->num_elems != -1)
+ _bt_apply_compare_array(arraysk, skey,
+ orderprocp, array, qual_ok);
+ else
+ _bt_apply_compare_skiparray(scan, arraysk, skey, orderproc,
+ orderprocp, array, qual_ok);
+
+ return true;
+}
+
+/*
+ * Finish off preprocessing of conventional (non-skip) array scan key when it
+ * is redundant with (or contradicted by) a non-array scalar scan key.
+ *
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ */
+static void
+_bt_apply_compare_array(ScanKey arraysk, ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ int cmpresult = 0,
+ cmpexact = 0,
+ matchelem,
+ new_nelems = 0;
+
+ Assert(array->num_elems > 0);
+ Assert(!(arraysk->sk_flags & SK_BT_SKIP));
+
matchelem = _bt_binsrch_array_skey(orderprocp, false,
NoMovementScanDirection,
skey->sk_argument, false, array,
@@ -1088,8 +1536,175 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
array->num_elems = new_nelems;
*qual_ok = new_nelems > 0;
+}
- return true;
+/*
+ * Finish off preprocessing of skip array scan key when it is redundant with
+ * (or contradicted by) a non-array scalar scan key.
+ *
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Arrays used to skip (skip scan/missing key attribute predicates) work by
+ * procedurally generating their elements on the fly. We must still
+ * "eliminate contradictory elements", but it works a little differently: we
+ * narrow the range of the skip array, such that the array will never
+ * generated contradicted-by-skey elements.
+ *
+ * FIXME Our behavior in scenarios with cross-type operators (range skip scan
+ * cases) is buggy. We're naively copying datums of a different type from
+ * scalar inequality scan keys into the array's low_value and high_value
+ * fields. In practice this tends to not visibly break (in practice types
+ * that appear within the same operator family tend to have compatible datum
+ * representations, at least on systems with little-endian byte order). Put
+ * off dealing with the problem until a later revision of the patch.
+ *
+ * It seems likely that the best way to fix this problem will involve keeping
+ * around the original operator in the BTArrayKeyInfo array struct whenever
+ * we're passed a "redundant" cross-type inequality operator (an approach
+ * involving casts/coercions might be tempting, but seems much too fragile).
+ * We only need to use not-column-input-opclass-type operators for the first
+ * and/or last array elements from the skip array under this scheme; we'll
+ * still mostly be dealing with opcintype-typed datums, copied from the index
+ * (as well as incrementing/decrementing copies of those index tuple datums).
+ * Importantly, this scheme should work just as well with an opfamily that
+ * doesn't even have an orderprocp cross-type ORDER operator to pass us here
+ * (we might even have to keep more than one same-strategy inequality, since
+ * in general _bt_preprocess_keys might not be able to prove which inequality
+ * is redundant).
+ */
+static void
+_bt_apply_compare_skiparray(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderproc, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ Relation rel = scan->indexRelation;
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Form_pg_attribute attr = TupleDescAttr(RelationGetDescr(rel),
+ skey->sk_attno - 1);
+ MemoryContext oldContext;
+ int cmpresult;
+
+ /*
+ * We don't expect to have to deal with NULLs in non-array/non-skip scan
+ * key. We expect _bt_preprocess_array_keys to avoid generating a skip
+ * array for an index attribute with an IS NULL input scan key. It will
+ * still do so in the presence of IS NOT NULL input scan keys, but
+ * _bt_compare_scankey_args is expected to handle those for us.
+ */
+ Assert(arraysk->sk_flags & SK_BT_SKIP);
+ Assert(arraysk->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_ISNULL));
+ Assert(array->num_elems == -1);
+
+ /*
+ * Scalar scan key must be a B-Tree operator, which must always be strict.
+ * Array shouldn't generate a NULL "array element"/an IS NULL qual. This
+ * isn't just an optimization; it's strictly necessary for correctness.
+ */
+ array->null_elem = false;
+
+ if (!array->use_sksup)
+ {
+ switch (skey->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ case BTLessEqualStrategyNumber:
+ array->high_compare = MemoryContextAlloc(so->arrayContext,
+ sizeof(ScanKeyData));
+ memcpy(array->high_compare, skey, sizeof(ScanKeyData));
+ break;
+ case BTGreaterEqualStrategyNumber:
+ case BTGreaterStrategyNumber:
+ array->low_compare = MemoryContextAlloc(so->arrayContext,
+ sizeof(ScanKeyData));
+ memcpy(array->low_compare, skey, sizeof(ScanKeyData));
+ break;
+ default:
+ elog(ERROR, "unrecognized StrategyNumber: %d",
+ (int) skey->sk_strategy);
+ break;
+ }
+
+ array->null_elem = false;
+ *qual_ok = true;
+
+ return;
+ }
+
+ switch (skey->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+
+ /*
+ * detect if scan key argument will be < low_value once
+ * decremented
+ */
+ cmpresult = _bt_compare_array_skey(orderprocp,
+ skey->sk_argument, false,
+ array->sksup.low_elem, false,
+ arraysk);
+ if (cmpresult <= 0)
+ {
+ /* decrementing would make qual unsatisfiable, so don't try */
+ *qual_ok = false;
+ return;
+ }
+
+ /* decremented scan key value becomes skip array's new high_value */
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.high_elem = _bt_apply_decrement(rel, skey, array);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ case BTLessEqualStrategyNumber:
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.high_elem = datumCopy(skey->sk_argument,
+ attr->attbyval, attr->attlen);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.low_elem = datumCopy(skey->sk_argument,
+ attr->attbyval, attr->attlen);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ case BTGreaterStrategyNumber:
+
+ /*
+ * detect if scan key argument will be > high_value once
+ * incremented
+ */
+ cmpresult = _bt_compare_array_skey(orderprocp,
+ skey->sk_argument, false,
+ array->sksup.high_elem, false,
+ arraysk);
+ if (cmpresult >= 0)
+ {
+ /* incrementing would make qual unsatisfiable, so don't try */
+ *qual_ok = false;
+ return;
+ }
+
+ /* incremented scan key value becomes skip array's new low_value */
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+ array->sksup.low_elem = _bt_apply_increment(rel, skey, array);
+ MemoryContextSwitchTo(oldContext);
+ break;
+ default:
+ elog(ERROR, "unrecognized StrategyNumber: %d",
+ (int) skey->sk_strategy);
+ break;
+ }
+
+ /*
+ * Is the qual contradictory, or is it merely "redundant" with consed-up
+ * skip array?
+ */
+ cmpresult = _bt_compare_array_skey(orderproc, /* don't use orderprocp */
+ array->sksup.low_elem, false,
+ array->sksup.high_elem, false,
+ arraysk);
+ *qual_ok = (cmpresult <= 0);
}
/*
@@ -1130,7 +1745,8 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
static inline int32
_bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur)
+ Datum arrdatum, bool arrnull,
+ ScanKey cur)
{
int32 result = 0;
@@ -1138,14 +1754,14 @@ _bt_compare_array_skey(FmgrInfo *orderproc,
if (tupnull) /* NULL tupdatum */
{
- if (cur->sk_flags & SK_ISNULL)
+ if (arrnull)
result = 0; /* NULL "=" NULL */
else if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+ else if (arrnull) /* NOT_NULL tupdatum, NULL arrdatum */
{
if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -1211,6 +1827,8 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
Datum arrdatum;
Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(!(cur->sk_flags & SK_BT_SKIP));
+ Assert(!(cur->sk_flags & SK_ISNULL)); /* plain arrays can't do this */
Assert(cur->sk_strategy == BTEqualStrategyNumber);
if (cur_elem_trig)
@@ -1246,7 +1864,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[low_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result <= 0)
{
@@ -1274,7 +1892,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[high_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result >= 0)
{
@@ -1301,7 +1919,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
arrdatum = array->elem_values[mid_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result == 0)
{
@@ -1326,13 +1944,196 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
*/
if (low_elem != mid_elem)
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- array->elem_values[low_elem], cur);
+ array->elem_values[low_elem], false,
+ cur);
*set_elem_result = result;
return low_elem;
}
+/*
+ * _bt_binsrch_skiparray_skey() -- "Binary search" within a skip array
+ *
+ * Skip scan arrays procedurally generate their elements on-demand. They
+ * largely function in the same way as standard arrays. They can be rolled
+ * over by standard arrays (standard array can also roll over skip arrays).
+ *
+ * This routine doesn't return an index into the array, because the array
+ * doesn't actually have any elements (it has low_value and high_value, which
+ * indicate the range of values that the array can generate). Note that this
+ * may include a NULL value/an IS NULL qual (unlike with true arrays).
+ *
+ * Sets *set_elem_result just like _bt_binsrch_array_skey would with a true
+ * array. The value 0 indicates that tupdatum/tupnull is within the range of
+ * the skip array. Other values indicate what _bt_compare_array_skey returned
+ * for the best available match to tupdatum/tupnull (in practice this means
+ * either the lowest item or the highest item in the range of the array).
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this skip
+ * array's scan key. We can apply this information to find the next matching
+ * array element in the current scan direction using fewer comparisons.
+ */
+static void
+_bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result)
+{
+ Datum arrdatum;
+ bool arrnull;
+
+ Assert(!ScanDirectionIsNoMovement(dir));
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_flags & SK_BT_REQFWD);
+ Assert(array->num_elems == -1);
+
+ /* Precheck for NULL tupdatum, array without a NULL element */
+ if (tupnull && !array->null_elem)
+ {
+ if (!(cur->sk_flags & SK_BT_NULLS_FIRST))
+ *set_elem_result = 1;
+ else
+ *set_elem_result = -1;
+
+ return;
+ }
+
+ /*
+ * Compare tupdatum against "first array element" in the current scan
+ * direction first (and allow NULL to be treated as a possible element).
+ *
+ * Optimization: don't have to bother with this when passed a skip array
+ * that is known to have triggered array advancement.
+ */
+ if (!cur_elem_trig)
+ {
+ if (array->use_sksup)
+ {
+ if (ScanDirectionIsForward(dir))
+ {
+ arrdatum = array->sksup.low_elem;
+ arrnull = array->null_elem &&
+ (cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+ else
+ {
+ arrdatum = array->sksup.high_elem;
+ arrnull = array->null_elem &&
+ !(cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+
+ *set_elem_result = _bt_compare_array_skey(orderproc,
+ tupdatum, tupnull,
+ arrdatum, arrnull, cur);
+
+ /*
+ * Optimization: return early when >= lower bound happens to be an
+ * exact match (or when <= upper bound is an exact match during a
+ * backwards scan)
+ */
+ if (*set_elem_result == 0)
+ return;
+ }
+ else
+ {
+ *set_elem_result = 0; /* for now */
+
+ if (ScanDirectionIsForward(dir) && array->low_compare)
+ {
+ ScanKey low_compare = array->low_compare;
+
+ if (!DatumGetBool(FunctionCall2Coll(&low_compare->sk_func,
+ low_compare->sk_collation,
+ tupdatum,
+ low_compare->sk_argument)))
+ *set_elem_result = -1;
+ }
+ else if (ScanDirectionIsBackward(dir) && array->high_compare)
+ {
+ ScanKey high_compare = array->high_compare;
+
+ if (!DatumGetBool(FunctionCall2Coll(&high_compare->sk_func,
+ high_compare->sk_collation,
+ tupdatum,
+ high_compare->sk_argument)))
+ *set_elem_result = 1;
+ }
+ }
+
+ /* tupdatum before the start of first element in scan direction? */
+ if ((ScanDirectionIsForward(dir) && *set_elem_result < 0) ||
+ (ScanDirectionIsBackward(dir) && *set_elem_result > 0))
+ return;
+ }
+
+ /*
+ * Now compare tupdatum to the last array element in the current scan
+ * direction (and allow NULL to be treated as a possible element)
+ */
+ if (array->use_sksup)
+ {
+ /*
+ * We have skip support, so there is literally a final element
+ */
+ if (ScanDirectionIsForward(dir))
+ {
+ arrdatum = array->sksup.high_elem;
+ arrnull = array->null_elem && !(cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+ else
+ {
+ arrdatum = array->sksup.low_elem;
+ arrnull = array->null_elem && (cur->sk_flags & SK_BT_NULLS_FIRST);
+ }
+ *set_elem_result = _bt_compare_array_skey(orderproc,
+ tupdatum, tupnull,
+ arrdatum, arrnull, cur);
+ }
+ else
+ {
+ *set_elem_result = 0; /* for now */
+
+ /*
+ * No skip support. Need to use any inequalities required in the
+ * current scan direction as demarcating where the final element is.
+ */
+ if (ScanDirectionIsForward(dir) && array->high_compare)
+ {
+ ScanKey high_compare = array->high_compare;
+
+ if (!DatumGetBool(FunctionCall2Coll(&high_compare->sk_func,
+ high_compare->sk_collation,
+ tupdatum,
+ high_compare->sk_argument)))
+ *set_elem_result = 1;
+ }
+ else if (ScanDirectionIsBackward(dir) && array->low_compare)
+ {
+ ScanKey low_compare = array->low_compare;
+
+ if (!DatumGetBool(FunctionCall2Coll(&low_compare->sk_func,
+ low_compare->sk_collation,
+ tupdatum,
+ low_compare->sk_argument)))
+ *set_elem_result = -1;
+ }
+ }
+
+ /* tupdatum after the end of final element in scan direction? */
+ if ((ScanDirectionIsForward(dir) && *set_elem_result > 0) ||
+ (ScanDirectionIsBackward(dir) && *set_elem_result < 0))
+ return;
+
+ /*
+ * tupdatum is within the range of the skip array. This is equivalent to
+ * _bt_binsrch_array_skey finding an exactly matching array element.
+ */
+ *set_elem_result = 0;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -1342,29 +2143,488 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
void
_bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- int i;
Assert(so->numArrayKeys);
Assert(so->qual_ok);
- for (i = 0; i < so->numArrayKeys; i++)
+ for (int i = 0; i < so->numArrayKeys; i++)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
- Assert(curArrayKey->num_elems > 0);
Assert(skey->sk_flags & SK_SEARCHARRAY);
- if (ScanDirectionIsBackward(dir))
- curArrayKey->cur_elem = curArrayKey->num_elems - 1;
- else
- curArrayKey->cur_elem = 0;
- skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
+ _bt_scankey_set_low_or_high(rel, skey, curArrayKey,
+ ScanDirectionIsForward(dir));
}
so->scanBehind = false;
}
+/*
+ * _bt_scankey_decrement() -- decrement scan key's sk_argument
+ *
+ * Unsets scan key "IS NULL" flags if required. Cannot handle "decrementing"
+ * sk_argument from a non-NULL value to the value NULL.
+ */
+static void
+_bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (skey->sk_flags & SK_ISNULL)
+ _bt_scankey_unset_isnull(rel, skey, array);
+ else
+ {
+ Datum dec_sk_argument;
+ Form_pg_attribute attr;
+
+ /* Get a decremented copy of existing sk_argument */
+ dec_sk_argument = _bt_apply_decrement(rel, skey, array);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set decremented copy of original sk_argument in scan key */
+ skey->sk_argument = dec_sk_argument;
+ }
+}
+
+/*
+ * _bt_scankey_increment() -- increment scan key's sk_argument
+ *
+ * Unsets scan key "IS NULL" flags if required. Cannot handle "incrementing"
+ * sk_argument from a non-NULL value to the value NULL.
+ */
+static void
+_bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(array->use_sksup);
+
+ if (skey->sk_flags & SK_ISNULL)
+ _bt_scankey_unset_isnull(rel, skey, array);
+ else
+ {
+ Datum inc_sk_argument;
+ Form_pg_attribute attr;
+
+ /* Get an incremented copy of existing sk_argument */
+ inc_sk_argument = _bt_apply_increment(rel, skey, array);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set incremented copy of original sk_argument in scan key */
+ skey->sk_argument = inc_sk_argument;
+ }
+}
+
+/*
+ * _bt_scankey_set_low_or_high() -- Set array scan key to lowest/highest element
+ *
+ * Caller also passes associated scan key, which will have its argument set to
+ * the lowest/highest array value in passing.
+ */
+static void
+_bt_scankey_set_low_or_high(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ bool low_not_high)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (array->num_elems != -1)
+ {
+ /* set low or high element for conventional array */
+ int set_elem = 0;
+
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+
+ if (!low_not_high)
+ set_elem = array->num_elems - 1;
+
+ /*
+ * Just copy over array datum (only skip arrays require freeing and
+ * allocating memory for sk_argument)
+ */
+ array->cur_elem = set_elem;
+ skey->sk_argument = array->elem_values[set_elem];
+
+ return;
+ }
+
+ /* set low or high element for skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Clear possibly-irrelevant flags (before possible setting some again) */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY);
+
+ if (array->null_elem &&
+ (low_not_high == ((skey->sk_flags & SK_BT_NULLS_FIRST) != 0)))
+ {
+ /* Set element to NULL (lowest/highest element) */
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+ }
+ else if (low_not_high)
+ {
+ /* Lowest array element isn't NULL */
+ ScanKey low_compare = array->low_compare;
+
+ if (array->use_sksup)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else if (!low_compare)
+ skey->sk_flags |= SK_BT_NEG_INF;
+ else if (low_compare->sk_subtype != InvalidOid &&
+ low_compare->sk_subtype !=
+ rel->rd_opcintype[skey->sk_attno - 1])
+ {
+ /* XXX papers-over lack of cross-type support in _bt_first */
+ skey->sk_flags |= SK_BT_NEG_INF;
+ }
+ else
+ {
+ skey->sk_argument = datumCopy(low_compare->sk_argument,
+ attr->attbyval, attr->attlen);
+
+ if (low_compare->sk_strategy == BTGreaterStrategyNumber)
+ skey->sk_flags |= SK_BT_NEXTKEY;
+ }
+ }
+ else
+ {
+ /* Highest array element isn't NULL */
+ ScanKey high_compare = array->high_compare;
+
+ if (array->use_sksup)
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+ else if (!high_compare)
+ skey->sk_flags |= SK_BT_POS_INF;
+ else if (high_compare->sk_subtype != InvalidOid &&
+ high_compare->sk_subtype !=
+ rel->rd_opcintype[skey->sk_attno - 1])
+ {
+ /* XXX papers-over lack of cross-type support in _bt_first */
+ skey->sk_flags |= SK_BT_POS_INF;
+ }
+ else
+ {
+ skey->sk_argument = datumCopy(high_compare->sk_argument,
+ attr->attbyval, attr->attlen);
+ if (high_compare->sk_strategy == BTLessStrategyNumber)
+ skey->sk_flags |= SK_BT_PREVKEY;
+ }
+ }
+}
+
+/*
+ * _bt_scankey_skip_increment() -- increment a skip scan key, and its array
+ *
+ * Returns true when the skip array was successfully incremented to the next
+ * value in the current scan direction, dir. Otherwise handles roll over by
+ * setting array to its final element for the current scan direction.
+ */
+static bool
+_bt_scankey_skip_increment(Relation rel, ScanDirection dir,
+ BTArrayKeyInfo *array, ScanKey skey,
+ FmgrInfo *orderproc)
+{
+ Datum sk_argument = skey->sk_argument;
+ bool sk_isnull = (skey->sk_flags & SK_ISNULL) != 0;
+ int compare;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(array->num_elems == -1);
+
+ /*
+ * Precheck for the sentinel values -inf and +inf. These values are only
+ * used for index columns whose input operator class doesn't provide its
+ * own skip support routine.
+ */
+ Assert(!(skey->sk_flags & SK_BT_POS_INF) || ScanDirectionIsForward(dir));
+ Assert(!(skey->sk_flags & SK_BT_NEG_INF) || ScanDirectionIsBackward(dir));
+ if (skey->sk_flags & (SK_BT_POS_INF | SK_BT_NEG_INF))
+ {
+ Assert(!array->use_sksup);
+ goto rollover;
+ }
+
+ skey->sk_flags &= ~(SK_BT_NEXTKEY | SK_BT_PREVKEY);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ if (array->high_compare)
+ {
+ ScanKey high_compare = array->high_compare;
+
+ Assert(!array->use_sksup);
+ Assert(!array->null_elem && !sk_isnull);
+
+ if (high_compare->sk_strategy == BTLessEqualStrategyNumber)
+ {
+ /* XXX Need to consider cross-type operator families here */
+ compare = _bt_compare_array_skey(orderproc,
+ high_compare->sk_argument, false,
+ sk_argument, sk_isnull, skey);
+ if (compare <= 0)
+ goto rollover;
+ }
+ else if (!DatumGetBool(FunctionCall2Coll(&high_compare->sk_func,
+ high_compare->sk_collation,
+ sk_argument,
+ high_compare->sk_argument)))
+ goto rollover;
+ }
+
+ if (!array->use_sksup)
+ {
+ /*
+ * Optimization: when the current array element is NULL, and the
+ * last item stored in the index is also NULL, treat NULL as the
+ * final array element (final when scanning forwards).
+ *
+ * This saves a useless primitive index scan that would otherwise
+ * try to locate a value after NULL.
+ */
+ if (sk_isnull && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ goto rollover;
+
+ /* "Increment" sk_argument to sentinel value */
+ skey->sk_flags |= SK_BT_NEXTKEY;
+ return true;
+ }
+
+ /* high_elem is final non-NULL element in current scan direction */
+ compare = _bt_compare_array_skey(orderproc,
+ array->sksup.high_elem, false,
+ sk_argument, sk_isnull, skey);
+ if (compare > 0)
+ {
+ /* Increment sk_argument to next non-NULL array element */
+ _bt_scankey_increment(rel, skey, array);
+
+ return true;
+ }
+ else if (compare == 0 && array->null_elem &&
+ !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument is already equal to non-NULL high_elem,
+ * but skip array's true highest element is actually NULL.
+ *
+ * "Increment" sk_argument to NULL.
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+
+ return true;
+ }
+
+ /* Exhausted all array elements in current scan direction */
+ }
+ else
+ {
+ if (array->low_compare)
+ {
+ ScanKey low_compare = array->low_compare;
+
+ Assert(!array->use_sksup);
+ Assert(!array->null_elem && !sk_isnull);
+
+ if (low_compare->sk_strategy == BTGreaterEqualStrategyNumber)
+ {
+ /* XXX Need to consider cross-type operator families here */
+ compare = _bt_compare_array_skey(orderproc,
+ low_compare->sk_argument, false,
+ sk_argument, sk_isnull, skey);
+ if (compare >= 0)
+ goto rollover;
+ }
+ else if (!DatumGetBool(FunctionCall2Coll(&low_compare->sk_func,
+ low_compare->sk_collation,
+ sk_argument,
+ low_compare->sk_argument)))
+ goto rollover;
+ }
+
+ if (!array->use_sksup)
+ {
+ /*
+ * Optimization: when the current array element is NULL, and the
+ * first item stored in the index is also NULL, treat NULL as the
+ * final array element (final when scanning backwards).
+ *
+ * This saves a useless primitive index scan that would otherwise
+ * try to locate a value before NULL.
+ */
+ if (sk_isnull && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ goto rollover;
+
+ /* "Decrement" sk_argument to sentinel value */
+ skey->sk_flags |= SK_BT_PREVKEY;
+ return true;
+ }
+
+ /* low_elem is final non-NULL element in current scan direction */
+ compare = _bt_compare_array_skey(orderproc,
+ array->sksup.low_elem, false,
+ sk_argument, sk_isnull, skey);
+ if (compare < 0)
+ {
+ /* Decrement sk_argument to previous non-NULL array element */
+ _bt_scankey_decrement(rel, skey, array);
+
+ return true;
+ }
+ else if (compare == 0 && array->null_elem &&
+ (skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument is already equal to non-NULL low_elem, but
+ * skip array's true lowest element is actually NULL.
+ *
+ * "Decrement" sk_argument to NULL.
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+
+ return true;
+ }
+
+ /* Exhausted all array elements in current scan direction */
+ }
+
+ /*
+ * Skip array rolls over. Start over at the array's lowest sorting value
+ * (or its highest value, for backward scans).
+ */
+rollover:
+
+ _bt_scankey_set_low_or_high(rel, skey, array, ScanDirectionIsForward(dir));
+
+ /* Caller must consider earlier/more significant arrays in turn */
+ return false;
+}
+
+/*
+ * _bt_scankey_set_element() -- Set skip array scan key's sk_argument
+ *
+ * Sets scan key to "IS NULL" when required, and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull)
+{
+ /* tupdatum within the range of low_value/high_value */
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(tupnull && !array->null_elem));
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY);
+
+ /*
+ * Treat tupdatum/tupnull as a matching array element.
+ *
+ * We just copy tupdatum into the array's scan key (there is no
+ * conventional array element for us to set, of course).
+ *
+ * Unlike standard arrays, skip arrays sometimes need to locate NULLs.
+ * Treat them as just another value from the domain of indexed values.
+ */
+ if (!tupnull)
+ skey->sk_argument = datumCopy(tupdatum, attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
+/*
+ * _bt_scankey_unset_isnull() -- increment/decrement scan key from NULL
+ *
+ * Unsets scan key's "IS NULL" marking, and sets the non-NULL value from the
+ * array immediately before (or immediate after) NULL in the key space.
+ */
+static void
+_bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(skey->sk_flags & SK_ISNULL);
+ Assert(array->use_sksup);
+ Assert(array->null_elem);
+
+ /*
+ * sk_argument must be set to whatever non-NULL value comes immediately
+ * before or after NULL
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY);
+ if (skey->sk_flags & SK_BT_NULLS_FIRST)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+}
+
+/*
+ * _bt_scankey_set_isnull() -- increment/decrement scan key to NULL
+ *
+ * Sets scan key to "IS NULL", and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & (SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY)));
+ Assert(array->null_elem);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set sk_argument to NULL */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
/*
* _bt_advance_array_keys_increment() -- Advance to next set of array elements
*
@@ -1380,6 +2640,7 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
static bool
_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
/*
@@ -1391,10 +2652,24 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
+ FmgrInfo *orderproc = &so->orderProcs[curArrayKey->scan_key];
int cur_elem = curArrayKey->cur_elem;
int num_elems = curArrayKey->num_elems;
bool rolled = false;
+ /* Handle incrementing a skip array */
+ if (num_elems == -1)
+ {
+ /* Attempt to incrementally advance this skip scan array */
+ if (_bt_scankey_skip_increment(rel, dir, curArrayKey, skey,
+ orderproc))
+ return true;
+
+ /* Array rolled over. Need to advance next array key, if any. */
+ continue;
+ }
+
+ /* Handle incrementing a true array */
if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
{
cur_elem = 0;
@@ -1411,7 +2686,7 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
if (!rolled)
return true;
- /* Need to advance next array key, if any */
+ /* Array rolled over. Need to advance next array key, if any. */
}
/*
@@ -1466,6 +2741,7 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
static void
_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int arrayidx = 0;
@@ -1473,7 +2749,6 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
ScanKey cur = so->keyData + ikey;
BTArrayKeyInfo *array = NULL;
- int first_elem_dir;
if (!(cur->sk_flags & SK_SEARCHARRAY) ||
cur->sk_strategy != BTEqualStrategyNumber)
@@ -1485,16 +2760,10 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
continue;
- if (ScanDirectionIsForward(dir))
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
+ Assert(array->num_elems != -1); /* No skipping of non-required arrays */
- if (array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -1558,6 +2827,8 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
{
ScanKey cur = so->keyData + ikey;
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
Datum tupdatum;
bool tupnull;
int32 result;
@@ -1617,11 +2888,14 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
continue;
}
+ if (cur->sk_flags & (SK_BT_NEG_INF | SK_BT_POS_INF))
+ return false;
+
tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull, cur);
/*
* Does this comparison indicate that caller must _not_ advance the
@@ -1631,6 +2905,9 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
(ScanDirectionIsBackward(dir) && result > 0))
return true;
+ if ((cur->sk_flags & (SK_BT_NEXTKEY | SK_BT_PREVKEY)) && result == 0)
+ return true;
+
/*
* Does this comparison indicate that caller should now advance the
* scan's arrays? (Must be if we get here during a readpagetup call.)
@@ -1954,18 +3231,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (beyond_end_advance)
{
- int final_elem_dir;
-
- if (ScanDirectionIsBackward(dir) || !array)
- final_elem_dir = 0;
- else
- final_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != final_elem_dir)
- {
- array->cur_elem = final_elem_dir;
- cur->sk_argument = array->elem_values[final_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
continue;
}
@@ -1990,18 +3258,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (!all_required_satisfied || cur->sk_attno > tupnatts)
{
- int first_elem_dir;
-
- if (ScanDirectionIsForward(dir) || !array)
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
continue;
}
@@ -2019,15 +3278,27 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
/*
* Binary search for closest match that's available from the array
*/
- set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
- cur_elem_trig, dir,
- tupdatum, tupnull, array, cur,
- &result);
+ if (array->num_elems != -1)
+ set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
- Assert(set_elem >= 0 && set_elem < array->num_elems);
+ /*
+ * Skip array. "Binary search" by checking if tupdatum/tupnull
+ * are within the low_value/high_value range of the skip array.
+ */
+ else
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
}
else
{
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
+
Assert(sktrig_required && required);
/*
@@ -2041,7 +3312,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull, cur);
}
/*
@@ -2100,11 +3371,62 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
}
}
- /* Advance array keys, even when set_elem isn't an exact match */
- if (array && array->cur_elem != set_elem)
+ /* Advance array keys, even when we don't have an exact match */
+
+ if (!array)
+ continue; /* no element to set in non-array */
+
+ /* Conventional arrays have a valid set_elem for us to advance to */
+ if (array->num_elems != -1)
{
- array->cur_elem = set_elem;
- cur->sk_argument = array->elem_values[set_elem];
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ cur->sk_argument = array->elem_values[set_elem];
+ }
+
+ continue;
+ }
+
+ /*
+ * Conceptually, skip arrays also have array elements. The actual
+ * elements/values are generated procedurally and on demand.
+ */
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+ Assert(required);
+
+ if (result == 0)
+ {
+ /*
+ * Anything within the range of possible element values is treated
+ * as "a match for one of the array's elements". Store the next
+ * scan key argument value by taking a copy of the tupdatum value
+ * from caller's tuple (or set scan key IS NULL when tupnull, iff
+ * the array's range of possible elements covers NULL).
+ */
+ _bt_scankey_set_element(rel, cur, array, tupdatum, tupnull);
+ }
+ else if (beyond_end_advance)
+ {
+ /*
+ * We need to set the array element to the final "element" in the
+ * current scan direction for "beyond end of array element" array
+ * advancement. See above for an explanation.
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
+ }
+ else
+ {
+ /*
+ * The closest matching element is the lowest element; even that
+ * still puts us ahead of caller's tuple in the key space. This
+ * process has to carry to any lower-order arrays. See above for
+ * an explanation.
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -2460,10 +3782,12 @@ end_toplevel_scan:
/*
* _bt_preprocess_keys() -- Preprocess scan keys
*
+ * The first call here (per btrescan) allocates so->keyData[].
* The given search-type keys (taken from scan->keyData[])
* are copied to so->keyData[] with possible transformation.
* scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
- * the number of output keys (possibly less, never greater).
+ * the number of output keys. Calling here a second time (during the same
+ * btrescan) is a no-op.
*
* The output keys are marked with additional sk_flags bits beyond the
* system-standard bits supplied by the caller. The DESC and NULLS_FIRST
@@ -2483,6 +3807,8 @@ end_toplevel_scan:
* within each attribute may be done as a byproduct of the processing here.
* That process must leave array scan keys (within an attribute) in the same
* order as corresponding entries from the scan's BTArrayKeyInfo array info.
+ * We might also cons up skip array scan keys that weren't present in the
+ * original input keys; these are also output in standard attribute order.
*
* The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
* if they must be satisfied in order to continue the scan forward or backward
@@ -2550,9 +3876,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
int16 *indoption = scan->indexRelation->rd_indoption;
int new_numberOfKeys;
int numberOfEqualCols;
- ScanKey inkeys;
- ScanKey outkeys;
- ScanKey cur;
+ ScanKey inputsk;
BTScanKeyPreproc xform[BTMaxStrategyNumber];
bool test_result;
int i,
@@ -2584,7 +3908,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
return; /* done if qual-less scan */
/* If any keys are SK_SEARCHARRAY type, set up array-key info */
- arrayKeyData = _bt_preprocess_array_keys(scan);
+ arrayKeyData = _bt_preprocess_array_keys(scan, &numberOfKeys);
if (!so->qual_ok)
{
/* unmatchable array, so give up */
@@ -2598,32 +3922,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
if (arrayKeyData)
{
- inkeys = arrayKeyData;
+ inputsk = arrayKeyData;
/* Also maintain keyDataMap for remapping so->orderProc[] later */
keyDataMap = MemoryContextAlloc(so->arrayContext,
numberOfKeys * sizeof(int));
}
else
- inkeys = scan->keyData;
+ inputsk = scan->keyData;
+
+ /*
+ * Now that we have an estimate of the number of output scan keys
+ * (including any skip array scan keys), allocate space for them
+ */
+ so->keyData = palloc(sizeof(ScanKeyData) * numberOfKeys);
- outkeys = so->keyData;
- cur = &inkeys[0];
/* we check that input keys are correctly ordered */
- if (cur->sk_attno < 1)
+ if (inputsk->sk_attno < 1)
elog(ERROR, "btree index keys must be ordered by attribute");
/* We can short-circuit most of the work if there's just one key */
if (numberOfKeys == 1)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
so->qual_ok = false;
- memcpy(outkeys, cur, sizeof(ScanKeyData));
+ memcpy(so->keyData, inputsk, sizeof(ScanKeyData));
so->numberOfKeys = 1;
/* We can mark the qual as required if it's for first index col */
- if (cur->sk_attno == 1)
- _bt_mark_scankey_required(outkeys);
+ if (inputsk->sk_attno == 1)
+ _bt_mark_scankey_required(so->keyData);
if (arrayKeyData)
{
/*
@@ -2631,8 +3959,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
* (we'll miss out on the single value array transformation, but
* that's not nearly as important when there's only one scan key)
*/
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+ Assert(so->keyData[0].sk_flags & SK_SEARCHARRAY);
+ Assert(so->keyData[0].sk_strategy != BTEqualStrategyNumber ||
(so->arrayKeys[0].scan_key == 0 &&
OidIsValid(so->orderProcs[0].fn_oid)));
}
@@ -2660,12 +3988,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* handle after-last-key processing. Actual exit from the loop is at the
* "break" statement below.
*/
- for (i = 0;; cur++, i++)
+ for (i = 0;; inputsk++, i++)
{
if (i < numberOfKeys)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
{
/* NULL can't be matched, so give up */
so->qual_ok = false;
@@ -2677,12 +4005,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* If we are at the end of the keys for a particular attr, finish up
* processing and emit the cleaned-up keys.
*/
- if (i == numberOfKeys || cur->sk_attno != attno)
+ if (i == numberOfKeys || inputsk->sk_attno != attno)
{
int priorNumberOfEqualCols = numberOfEqualCols;
/* check input keys are correctly ordered */
- if (i < numberOfKeys && cur->sk_attno < attno)
+ if (i < numberOfKeys && inputsk->sk_attno < attno)
elog(ERROR, "btree index keys must be ordered by attribute");
/*
@@ -2741,7 +4069,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
return;
}
/* else discard the redundant non-equality key */
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
xform[j].skey = NULL;
xform[j].ikey = -1;
}
@@ -2786,7 +4115,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
}
/*
- * Emit the cleaned-up keys into the outkeys[] array, and then
+ * Emit the cleaned-up keys into the so->keyData[] array, and then
* mark them if they are required. They are required (possibly
* only in one direction) if all attrs before this one had "=".
*/
@@ -2794,7 +4123,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
{
if (xform[j].skey)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
@@ -2811,19 +4140,19 @@ _bt_preprocess_keys(IndexScanDesc scan)
break;
/* Re-initialize for new attno */
- attno = cur->sk_attno;
+ attno = inputsk->sk_attno;
memset(xform, 0, sizeof(xform));
}
/* check strategy this key's operator corresponds to */
- j = cur->sk_strategy - 1;
+ j = inputsk->sk_strategy - 1;
/* if row comparison, push it directly to the output array */
- if (cur->sk_flags & SK_ROW_HEADER)
+ if (inputsk->sk_flags & SK_ROW_HEADER)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
- memcpy(outkey, cur, sizeof(ScanKeyData));
+ memcpy(outkey, inputsk, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = i;
if (numberOfEqualCols == attno - 1)
@@ -2837,19 +4166,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
continue;
}
- /*
- * Does this input scan key require further processing as an array?
- */
- if (cur->sk_strategy == InvalidStrategy)
- {
- /* _bt_preprocess_array_keys marked this array key redundant */
- Assert(arrayKeyData);
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- continue;
- }
-
- if (cur->sk_strategy == BTEqualStrategyNumber &&
- (cur->sk_flags & SK_SEARCHARRAY))
+ if (inputsk->sk_strategy == BTEqualStrategyNumber &&
+ (inputsk->sk_flags & SK_SEARCHARRAY))
{
/* _bt_preprocess_array_keys kept this array key */
Assert(arrayKeyData);
@@ -2863,7 +4181,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (xform[j].skey == NULL)
{
/* nope, so this scan key wins by default (at least for now) */
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2881,7 +4199,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
/*
* Have to set up array keys
*/
- if ((cur->sk_flags & SK_SEARCHARRAY))
+ if ((inputsk->sk_flags & SK_SEARCHARRAY))
{
array = &so->arrayKeys[arrayidx - 1];
orderproc = so->orderProcs + i;
@@ -2909,13 +4227,15 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
}
- if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
- array, orderproc, &test_result))
+ if (_bt_compare_scankey_args(scan, inputsk, inputsk,
+ xform[j].skey, array, orderproc,
+ &test_result))
{
/* Have all we need to determine redundancy */
if (test_result)
{
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
/*
* New key is more restrictive, and so replaces old key...
@@ -2923,7 +4243,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2936,7 +4256,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
* scan key. _bt_compare_scankey_args expects us to
* always keep arrays (and discard non-arrays).
*/
- Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+ Assert(!(inputsk->sk_flags & SK_SEARCHARRAY));
}
}
else if (j == (BTEqualStrategyNumber - 1))
@@ -2959,14 +4279,14 @@ _bt_preprocess_keys(IndexScanDesc scan)
* even with incomplete opfamilies. _bt_advance_array_keys
* depends on this.
*/
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
if (numberOfEqualCols == attno - 1)
_bt_mark_scankey_required(outkey);
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -3057,10 +4377,11 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
if (array->scan_key != ikey)
return false;
- if (array->num_elems <= 0)
+ if (array->num_elems == 0 || array->num_elems < -1)
return false;
- if (cur->sk_argument != array->elem_values[array->cur_elem])
+ if (array->num_elems != -1 &&
+ cur->sk_argument != array->elem_values[array->cur_elem])
return false;
if (last_sk_attno > cur->sk_attno)
return false;
@@ -3135,6 +4456,22 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool leftnull,
rightnull;
+ /* Handle skip array comparison with IS NOT NULL scan key */
+ if ((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP)
+ {
+ /* Shouldn't generate skip array in presence of IS NULL key */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNULL));
+ Assert((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNOTNULL);
+
+ /* Don't allow skip array to generate IS NULL scan key/element */
+ Assert(array->num_elems == -1);
+ array->null_elem = false;
+
+ /* IS NOT NULL key (could be leftarg or rightarg) now redundant */
+ *result = true;
+ return true;
+ }
+
if (leftarg->sk_flags & SK_ISNULL)
{
Assert(leftarg->sk_flags & (SK_SEARCHNULL | SK_SEARCHNOTNULL));
@@ -3208,6 +4545,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
{
/* Can't make the comparison */
*result = false; /* suppress compiler warnings */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
return false;
}
@@ -3380,13 +4718,6 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
return true;
}
- if (skey->sk_strategy == InvalidStrategy)
- {
- /* Already-eliminated array scan key; don't need to fix anything */
- Assert(skey->sk_flags & SK_SEARCHARRAY);
- return true;
- }
-
/* Adjust strategy for DESC, if we didn't already */
if ((addflags & SK_BT_DESC) && !(skey->sk_flags & SK_BT_DESC))
skey->sk_strategy = BTCommuteStrategyNumber(skey->sk_strategy);
@@ -3734,6 +5065,21 @@ _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
continue;
}
+ /*
+ * A skip array scan key might be negative/positive infinity. Might
+ * also be next key/previous key sentinel, which we don't deal with.
+ */
+ if (key->sk_flags & (SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY))
+ {
+ Assert(key->sk_flags & SK_SEARCHARRAY);
+ Assert(key->sk_flags & SK_BT_SKIP);
+ Assert(requiredSameDir);
+
+ *continuescan = false;
+ return false;
+ }
+
/* row-comparison keys need special processing */
if (key->sk_flags & SK_ROW_HEADER)
{
diff --git a/src/backend/access/nbtree/nbtvalidate.c b/src/backend/access/nbtree/nbtvalidate.c
index e9d4cd60d..96d0d9185 100644
--- a/src/backend/access/nbtree/nbtvalidate.c
+++ b/src/backend/access/nbtree/nbtvalidate.c
@@ -114,6 +114,10 @@ btvalidate(Oid opclassoid)
case BTOPTIONS_PROC:
ok = check_amoptsproc_signature(procform->amproc);
break;
+ case BTSKIPSUPPORT_PROC:
+ ok = check_amproc_signature(procform->amproc, VOIDOID, true,
+ 1, 1, INTERNALOID);
+ break;
default:
ereport(INFO,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index b8b5c147c..a86dbf71b 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -1330,6 +1330,31 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("btree equal image functions must not be cross-type")));
}
+ else if (member->number == BTSKIPSUPPORT_PROC)
+ {
+ if (procform->pronargs != 1 ||
+ procform->proargtypes.values[0] != INTERNALOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must accept type \"internal\"")));
+ if (procform->prorettype != VOIDOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must return void")));
+
+ /*
+ * pg_amproc functions are indexed by (lefttype, righttype), but a
+ * skip support function doesn't make sense in cross-type
+ * scenarios. The same opclass opcintype OID is always used for
+ * lefttype and righttype. Providing a cross-type routine isn't
+ * sensible. Reject cross-type ALTER OPERATOR FAMILY ... ADD
+ * FUNCTION 6 statements here.
+ */
+ if (member->lefttype != member->righttype)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must not be cross-type")));
+ }
}
else if (amoid == HASH_AM_OID)
{
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index edb09d4e3..e945686c8 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -96,6 +96,7 @@ OBJS = \
rowtypes.o \
ruleutils.o \
selfuncs.o \
+ skipsupport.o \
tid.o \
timestamp.o \
trigfuncs.o \
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 9c854e0e5..ea3d0f4b5 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -34,6 +34,7 @@
#include "utils/date.h"
#include "utils/datetime.h"
#include "utils/numeric.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
/*
@@ -455,6 +456,39 @@ date_sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+date_decrement(Relation rel, Datum existing)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ Assert(dexisting > DATEVAL_NOBEGIN);
+
+ return DateADTGetDatum(dexisting - 1);
+}
+
+static Datum
+date_increment(Relation rel, Datum existing)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ Assert(dexisting < DATEVAL_NOEND);
+
+ return DateADTGetDatum(dexisting + 1);
+}
+
+Datum
+date_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = date_decrement;
+ sksup->increment = date_increment;
+ sksup->low_elem = DateADTGetDatum(DATEVAL_NOBEGIN);
+ sksup->high_elem = DateADTGetDatum(DATEVAL_NOEND);
+
+ PG_RETURN_VOID();
+}
+
Datum
date_finite(PG_FUNCTION_ARGS)
{
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index 8c6fc80c3..91682edd5 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -83,6 +83,7 @@ backend_sources += files(
'rowtypes.c',
'ruleutils.c',
'selfuncs.c',
+ 'skipsupport.c',
'tid.c',
'timestamp.c',
'trigfuncs.c',
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d..33b1722df 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6800,6 +6800,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
List *indexBoundQuals;
int indexcol;
bool eqQualHere;
+ bool found_skip;
bool found_saop;
bool found_is_null_op;
double num_sa_scans;
@@ -6825,6 +6826,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
+ found_skip = false;
found_saop = false;
found_is_null_op = false;
num_sa_scans = 1;
@@ -6833,15 +6835,38 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
IndexClause *iclause = lfirst_node(IndexClause, lc);
ListCell *lc2;
+ /*
+ * XXX For now we just cost skip scans via generic rules: make a
+ * uniform assumption that there will be 10 primitive index scans per
+ * skipped attribute, relying on the "1/3 of all index pages" cap that
+ * this costing has used since Postgres 17. Also assume that skipping
+ * won't take place for an index that has fewer than 100 pages.
+ *
+ * The current approach to costing leaves much to be desired, but is
+ * at least better than nothing at all (keeping the code as it is on
+ * HEAD just makes testing and review inconvenient).
+ */
if (indexcol != iclause->indexcol)
{
/* Beginning of a new column's quals */
if (!eqQualHere)
- break; /* done if no '=' qual for indexcol */
+ {
+ found_skip = true; /* skip when no '=' qual for indexcol */
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10;
+ }
eqQualHere = false;
indexcol++;
if (indexcol != iclause->indexcol)
- break; /* no quals at all for indexcol */
+ {
+ /* no quals at all for indexcol */
+ found_skip = true;
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10 * (iclause->indexcol - indexcol);
+ continue;
+ }
}
/* Examine each indexqual associated with this index clause */
@@ -6914,6 +6939,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (index->unique &&
indexcol == index->nkeycolumns - 1 &&
eqQualHere &&
+ !found_skip &&
!found_saop &&
!found_is_null_op)
numIndexTuples = 1.0;
diff --git a/src/backend/utils/adt/skipsupport.c b/src/backend/utils/adt/skipsupport.c
new file mode 100644
index 000000000..9665e4985
--- /dev/null
+++ b/src/backend/utils/adt/skipsupport.c
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.c
+ * Support routines for B-Tree skip scans.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/skipsupport.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "access/nbtree.h"
+#include "utils/lsyscache.h"
+#include "utils/skipsupport.h"
+
+/*
+ * Fill in SkipSupport given an operator class (opfamily + opcintype).
+ *
+ * On success, returns true, and initializes all SkipSupport fields for
+ * caller. Otherwise returns false, indicating that operator class has no
+ * skip support function.
+ */
+bool
+PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype, bool reverse,
+ SkipSupport sksup)
+{
+ Oid skipSupportFunction;
+
+ /* Look for a skip support function */
+ skipSupportFunction = get_opfamily_proc(opfamily, opcintype, opcintype,
+ BTSKIPSUPPORT_PROC);
+ if (!OidIsValid(skipSupportFunction))
+ return false;
+
+ OidFunctionCall1(skipSupportFunction, PointerGetDatum(sksup));
+
+ if (reverse)
+ {
+ Datum low_elem = sksup->low_elem;
+
+ sksup->low_elem = sksup->high_elem;
+ sksup->high_elem = low_elem;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c
index 45eb1b2fe..a9222f896 100644
--- a/src/backend/utils/adt/uuid.c
+++ b/src/backend/utils/adt/uuid.c
@@ -13,12 +13,15 @@
#include "postgres.h"
+#include <limits.h>
+
#include "common/hashfn.h"
#include "lib/hyperloglog.h"
#include "libpq/pqformat.h"
#include "port/pg_bswap.h"
#include "utils/fmgrprotos.h"
#include "utils/guc.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#include "utils/timestamp.h"
#include "utils/uuid.h"
@@ -390,6 +393,68 @@ uuid_abbrev_convert(Datum original, SortSupport ssup)
return res;
}
+static Datum
+uuid_decrement(Relation rel, Datum existing)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] > 0)
+ {
+ uuid->data[i]--;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = UCHAR_MAX;
+ }
+
+ Assert(false);
+
+ return UUIDPGetDatum(uuid);
+}
+
+static Datum
+uuid_increment(Relation rel, Datum existing)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] < UCHAR_MAX)
+ {
+ uuid->data[i]++;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = 0;
+ }
+
+ Assert(false);
+
+ return UUIDPGetDatum(uuid);
+}
+
+Datum
+uuid_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+ pg_uuid_t *uuid_min = palloc(UUID_LEN);
+ pg_uuid_t *uuid_max = palloc(UUID_LEN);
+
+ memset(uuid_min->data, 0x00, UUID_LEN);
+ memset(uuid_max->data, 0xFF, UUID_LEN);
+
+ sksup->decrement = uuid_decrement;
+ sksup->increment = uuid_increment;
+ sksup->low_elem = UUIDPGetDatum(uuid_min);
+ sksup->high_elem = UUIDPGetDatum(uuid_max);
+
+ PG_RETURN_VOID();
+}
+
/* hash index support */
Datum
uuid_hash(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 630ed0f16..6fc3ca1a7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/nbtree.h"
#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
@@ -1702,6 +1703,17 @@ struct config_bool ConfigureNamesBool[] =
},
#endif
+ /* XXX Remove before commit */
+ {
+ {"skipscan_skipsupport_enabled", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_skipsupport_enabled,
+ true,
+ NULL, NULL, NULL
+ },
+
{
{"integer_datetimes", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows whether datetimes are integer based."),
@@ -3525,6 +3537,17 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ /* XXX Remove before commit */
+ {
+ {"skipscan_prefix_cols", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_prefix_cols,
+ INDEX_MAX_KEYS, 0, INDEX_MAX_KEYS,
+ NULL, NULL, NULL
+ },
+
{
/* Can't be set in postgresql.conf */
{"server_version_num", PGC_INTERNAL, PRESET_OPTIONS,
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 2b3997988..9662fb2ba 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -583,6 +583,19 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</para>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><function>skipsupport</function></term>
+ <listitem>
+ <para>
+ Optionally, a btree operator family may provide a <firstterm>skip
+ support</firstterm> function, registered under support function
+ number 6. These functions allow the B-tree code to more efficiently
+ navigate the index structure via an index <quote>skip scan</quote>. The
+ APIs involved in this are defined in
+ <filename>src/include/utils/skipsupport.h</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</sect2>
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 22d8ad1aa..f17dd3456 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -461,6 +461,13 @@
</entry>
<entry>5</entry>
</row>
+ <row>
+ <entry>
+ Return the addresses of C-callable skip support function(s)
+ (optional)
+ </entry>
+ <entry>6</entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -1056,7 +1063,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
FUNCTION 1 btint8cmp(int8, int8) ,
FUNCTION 2 btint8sortsupport(internal) ,
FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint8skipsupport(internal);
CREATE OPERATOR CLASS int4_ops
DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -1069,7 +1077,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
FUNCTION 1 btint4cmp(int4, int4) ,
FUNCTION 2 btint4sortsupport(internal) ,
FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint4skipsupport(internal);
CREATE OPERATOR CLASS int2_ops
DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1082,7 +1091,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
FUNCTION 1 btint2cmp(int2, int2) ,
FUNCTION 2 btint2sortsupport(internal) ,
FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint2skipsupport(internal);
ALTER OPERATOR FAMILY integer_ops USING btree ADD
-- cross-type comparisons int8 vs int2
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index ae54cb254..8b6b775c1 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -362,9 +362,9 @@ ERROR: invalid operator number 0, must be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ERROR: operator argument types must be specified in ALTER OPERATOR FAMILY
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ERROR: invalid function number 0, must be between 1 and 5
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR: invalid function number 6, must be between 1 and 5
+ERROR: invalid function number 0, must be between 1 and 6
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
+ERROR: invalid function number 7, must be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
ERROR: STORAGE cannot be specified in ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 3bbe4c5f9..a8d5be6c1 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5138,9 +5138,10 @@ List of access methods
btree | uuid_ops | uuid | uuid | 1 | uuid_cmp
btree | uuid_ops | uuid | uuid | 2 | uuid_sortsupport
btree | uuid_ops | uuid | uuid | 4 | btequalimage
+ btree | uuid_ops | uuid | uuid | 6 | uuid_skipsupport
hash | uuid_ops | uuid | uuid | 1 | uuid_hash
hash | uuid_ops | uuid | uuid | 2 | uuid_hash_extended
-(5 rows)
+(6 rows)
-- check \dconfig
set work_mem = 10240;
diff --git a/src/test/regress/sql/alter_generic.sql b/src/test/regress/sql/alter_generic.sql
index de58d268d..4246afefd 100644
--- a/src/test/regress/sql/alter_generic.sql
+++ b/src/test/regress/sql/alter_generic.sql
@@ -310,7 +310,7 @@ ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 6 < (int4, int2); -- ope
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 0 < (int4, int2); -- operator number should be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 635e6d6e2..58dec6a16 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -218,6 +218,7 @@ BTScanPos
BTScanPosData
BTScanPosItem
BTShared
+BTSkipPreproc
BTSortArrayContext
BTSpool
BTStack
@@ -2653,6 +2654,8 @@ SingleBoundSortItem
SinglePartitionSpec
Size
SkipPages
+SkipSupport
+SkipSupportData
SlabBlock
SlabContext
SlabSlot
--
2.45.2
On Mon, Jul 15, 2024 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v3, which generalizes skip scan, allowing it to work with
opclasses/types that lack a skip support routine. In other words, v3
makes skip scan work for all types, including continuous types, where
it's impractical or infeasible to add skip support.
Attached is v4, which:
* Fixes a previous FIXME item affecting range skip scans/skip arrays
used in cross-type scenarios.
* Refactors and simplifies the handling of range inequalities
associated with skip arrays more generally. We now always use
inequality scan keys during array advancement (and when descending the
tree within _bt_first), rather than trying to use a datum taken from
the range inequality as an array element directly.
This gives us cleaner separation between scan keys/data types in
cross-type scenarios: skip arrays will now only ever contain
"elements" of opclass input type. Sentinel values such as -inf are
expanded to represent "the lowest possible value that comes after the
array's low_compare lower bound, if any". Opclasses that don't offer
skip support took roughly this same approach within v3, but in v4 all
opclasses do it the same way (so opclasses with skip support use the
SK_BT_NEG_INF sentinel marking in their scan keys, though never the
SK_BT_NEXTKEY sentinel marking).
This is really just a refactoring revision. Nothing particularly
exciting here compared to v3.
--
Peter Geoghegan
Attachments:
v4-0001-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v4-0001-Add-skip-scan-to-nbtree.patchDownload
From 3773fec62437d0f9a55d0484072b926acbfba001 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 16 Apr 2024 13:21:36 -0400
Subject: [PATCH v4] Add skip scan to nbtree.
Skip scan allows nbtree index scans to efficiently use a composite index
on an index (a, b) for queries with a predicate such as "WHERE b = 5".
This is useful in cases where the total number of distinct values in the
column 'a' is reasonably small (think hundreds, possibly thousands).
In effect, a skip scan treats the composite index on (a, b) as if it was
a series of disjunct subindexes -- one subindex per distinct 'a' value.
We exhaustively "search every subindex" using a qual that behaves like
"WHERE a = ANY(<every possible 'a' value>) AND b = 5".
The design of skip scan works by extended the design for arrays
established by commit 5bf748b8. "Skip arrays" generate their array
values procedurally and on-demand, but otherwise work just like arrays
used by SAOPs.
B-Tree operator classes on discrete types can now optionally provide a
skip support routine. This is used to generate the next array element
value by incrementing the current value (or by decrementing, in the case
of backwards scans). When the opclass lacks a skip support routine, we
use sentinel next-key values instead. Adding skip support makes skip
scans more efficient in cases where there is naturally a good chance
that the very next value will find matching tuples. For example, during
an index scan with a leading "sales_date" attribute, there is a decent
chance that a scan that just finished returning tuples matching
"sales_date = '2024-06-01' and id = 5000" will find later tuples
matching "sales_date = '2024-06-02' and id = 5000". It is to our
advantage to skip straight to the relevant "id = 5000" leaf page,
totally avoiding reading earlier "sales_date = '2024-06-02'" leaf pages.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
---
src/include/access/nbtree.h | 27 +-
src/include/catalog/pg_amproc.dat | 16 +
src/include/catalog/pg_proc.dat | 24 +
src/include/utils/skipsupport.h | 107 ++
src/backend/access/nbtree/nbtcompare.c | 261 +++
src/backend/access/nbtree/nbtree.c | 10 +-
src/backend/access/nbtree/nbtsearch.c | 111 +-
src/backend/access/nbtree/nbtutils.c | 1595 ++++++++++++++++---
src/backend/access/nbtree/nbtvalidate.c | 4 +
src/backend/commands/opclasscmds.c | 25 +
src/backend/utils/adt/Makefile | 1 +
src/backend/utils/adt/date.c | 44 +
src/backend/utils/adt/meson.build | 1 +
src/backend/utils/adt/selfuncs.c | 30 +-
src/backend/utils/adt/skipsupport.c | 52 +
src/backend/utils/adt/uuid.c | 67 +
src/backend/utils/misc/guc_tables.c | 23 +
doc/src/sgml/btree.sgml | 13 +
doc/src/sgml/xindex.sgml | 16 +-
src/test/regress/expected/alter_generic.out | 6 +-
src/test/regress/expected/psql.out | 3 +-
src/test/regress/sql/alter_generic.sql | 2 +-
src/tools/pgindent/typedefs.list | 3 +
23 files changed, 2237 insertions(+), 204 deletions(-)
create mode 100644 src/include/utils/skipsupport.h
create mode 100644 src/backend/utils/adt/skipsupport.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 749304334..945091021 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+#include "utils/skipsupport.h"
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -709,7 +710,8 @@ BTreeTupleGetMaxHeapTID(IndexTuple itup)
#define BTINRANGE_PROC 3
#define BTEQUALIMAGE_PROC 4
#define BTOPTIONS_PROC 5
-#define BTNProcs 5
+#define BTSKIPSUPPORT_PROC 6
+#define BTNProcs 6
/*
* We need to be able to tell the difference between read and write
@@ -1031,10 +1033,22 @@ typedef BTScanPosData *BTScanPos;
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
typedef struct BTArrayKeyInfo
{
+ /* fields used by both kinds of array (standard arrays and skip arrays) */
int scan_key; /* index of associated key in keyData */
+ int num_elems; /* number of elems (-1 for skip array) */
+
+ /* fields for standard arrays that store elements in memory */
int cur_elem; /* index of current element in elem_values */
- int num_elems; /* number of elems in current array value */
Datum *elem_values; /* array of num_elems Datums */
+
+ /* fields for skip arrays, which generate their elements procedurally */
+ bool use_sksup; /* sksup set to valid routine? */
+ bool null_elem; /* lowest/highest element actually NULL? */
+ SkipSupportData sksup; /* opclass skip scan support, when use_sksup */
+ ScanKey low_compare; /* array's > or >= lower bound */
+ ScanKey high_compare; /* array's < or <= upper bound */
+ FmgrInfo order_low; /* low_compare's ORDER proc */
+ FmgrInfo order_high; /* high_compare's ORDER proc */
} BTArrayKeyInfo;
typedef struct BTScanOpaqueData
@@ -1123,6 +1137,11 @@ typedef struct BTReadPageState
*/
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
+#define SK_BT_SKIP 0x00040000 /* skip array, for skip scan */
+#define SK_BT_NEG_INF 0x00080000 /* -inf skip array element in sk_argument */
+#define SK_BT_POS_INF 0x00100000 /* +inf skip array element in sk_argument */
+#define SK_BT_NEXTKEY 0x00200000 /* interpret sk_argument as +infinitesimal */
+#define SK_BT_PREVKEY 0x00400000 /* interpret sk_argument as -infinitesimal */
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1159,6 +1178,10 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+/* GUC parameters (just a temporary convenience for reviewers) */
+extern PGDLLIMPORT int skipscan_prefix_cols;
+extern PGDLLIMPORT bool skipscan_skipsupport_enabled;
+
/*
* external entry points for btree, in nbtree.c
*/
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index f639c3a6a..2a8f6f3f1 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -21,6 +21,8 @@
amprocrighttype => 'bit', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '1', amproc => 'btboolcmp' },
+{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
+ amprocrighttype => 'bool', amprocnum => '6', amproc => 'btboolskipsupport' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
@@ -41,12 +43,16 @@
amprocrighttype => 'char', amprocnum => '1', amproc => 'btcharcmp' },
{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'btcharskipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1', amproc => 'date_cmp' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '2', amproc => 'date_sortsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'date_skipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'timestamp', amprocnum => '1',
amproc => 'date_cmp_timestamp' },
@@ -122,6 +128,8 @@
amprocrighttype => 'int2', amprocnum => '2', amproc => 'btint2sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'btint2skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint24cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
@@ -141,6 +149,8 @@
amprocrighttype => 'int4', amprocnum => '2', amproc => 'btint4sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'btint4skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint48cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
@@ -160,6 +170,8 @@
amprocrighttype => 'int8', amprocnum => '2', amproc => 'btint8sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'btint8skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint84cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
@@ -193,6 +205,8 @@
amprocrighttype => 'oid', amprocnum => '2', amproc => 'btoidsortsupport' },
{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'btoidskipsupport' },
{ amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
amprocrighttype => 'oidvector', amprocnum => '1',
amproc => 'btoidvectorcmp' },
@@ -261,6 +275,8 @@
amprocrighttype => 'uuid', amprocnum => '2', amproc => 'uuid_sortsupport' },
{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'uuid_skipsupport' },
{ amprocfamily => 'btree/record_ops', amproclefttype => 'record',
amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordcmp' },
{ amprocfamily => 'btree/record_image_ops', amproclefttype => 'record',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 73d9cf858..27921e0df 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1004,18 +1004,27 @@
{ oid => '3129', descr => 'sort support',
proname => 'btint2sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint2sortsupport' },
+{ oid => '9290', descr => 'skip support',
+ proname => 'btint2skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint2skipsupport' },
{ oid => '351', descr => 'less-equal-greater',
proname => 'btint4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int4 int4', prosrc => 'btint4cmp' },
{ oid => '3130', descr => 'sort support',
proname => 'btint4sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint4sortsupport' },
+{ oid => '9291', descr => 'skip support',
+ proname => 'btint4skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint4skipsupport' },
{ oid => '842', descr => 'less-equal-greater',
proname => 'btint8cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int8 int8', prosrc => 'btint8cmp' },
{ oid => '3131', descr => 'sort support',
proname => 'btint8sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint8sortsupport' },
+{ oid => '9292', descr => 'skip support',
+ proname => 'btint8skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint8skipsupport' },
{ oid => '354', descr => 'less-equal-greater',
proname => 'btfloat4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'float4 float4', prosrc => 'btfloat4cmp' },
@@ -1034,12 +1043,18 @@
{ oid => '3134', descr => 'sort support',
proname => 'btoidsortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btoidsortsupport' },
+{ oid => '9293', descr => 'skip support',
+ proname => 'btoidskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btoidskipsupport' },
{ oid => '404', descr => 'less-equal-greater',
proname => 'btoidvectorcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'oidvector oidvector', prosrc => 'btoidvectorcmp' },
{ oid => '358', descr => 'less-equal-greater',
proname => 'btcharcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'char char', prosrc => 'btcharcmp' },
+{ oid => '9294', descr => 'skip support',
+ proname => 'btcharskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btcharskipsupport' },
{ oid => '359', descr => 'less-equal-greater',
proname => 'btnamecmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'name name', prosrc => 'btnamecmp' },
@@ -2214,6 +2229,9 @@
{ oid => '3136', descr => 'sort support',
proname => 'date_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'date_sortsupport' },
+{ oid => '9295', descr => 'skip support',
+ proname => 'date_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'date_skipsupport' },
{ oid => '4133', descr => 'window RANGE support',
proname => 'in_range', prorettype => 'bool',
proargtypes => 'date date interval bool bool',
@@ -4368,6 +4386,9 @@
{ oid => '1693', descr => 'less-equal-greater',
proname => 'btboolcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'bool bool', prosrc => 'btboolcmp' },
+{ oid => '9296', descr => 'skip support',
+ proname => 'btboolskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btboolskipsupport' },
{ oid => '1688', descr => 'hash',
proname => 'time_hash', prorettype => 'int4', proargtypes => 'time',
@@ -9192,6 +9213,9 @@
{ oid => '3300', descr => 'sort support',
proname => 'uuid_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'uuid_sortsupport' },
+{ oid => '9297', descr => 'skip support',
+ proname => 'uuid_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'uuid_skipsupport' },
{ oid => '2961', descr => 'I/O',
proname => 'uuid_recv', prorettype => 'uuid', proargtypes => 'internal',
prosrc => 'uuid_recv' },
diff --git a/src/include/utils/skipsupport.h b/src/include/utils/skipsupport.h
new file mode 100644
index 000000000..3d76c66b3
--- /dev/null
+++ b/src/include/utils/skipsupport.h
@@ -0,0 +1,107 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.h
+ * Support routines for B-Tree skip scan.
+ *
+ * B-Tree operator classes for discrete types can optionally provide a support
+ * function for skipping. This is used during skip scans.
+ *
+ * A B-tree operator class that implements skip support provides B-tree index
+ * scans with a way of enumerating and iterating through every possible value
+ * from the domain of indexable values. This gives scans a way to determine
+ * the next value in line for a given skip array/scan key/skipped attribute.
+ * This happens at the point where the scan determines that another primitive
+ * index scan is required. The next value is used (in combination with at
+ * least one additional lower-order non-skip key, taken from the SQL query) to
+ * relocate the scan, skipping over many irrelevant leaf pages in the process.
+ *
+ * Skip support generally works best with discrete types such as integer,
+ * date, and boolean; types where there is a decent chance that indexes will
+ * contain contiguous values (given a leading attributes using the opclass).
+ * When gaps/discontinuities are naturally rare (e.g., a leading identity
+ * column in a composite index, a date column preceding a product_id column),
+ * then it makes sense for skip scans to optimistically assume that the next
+ * distinct indexable value will find directly matching index tuples.
+ *
+ * The B-Tree code can fall back on next-key sentinel values for any opclass
+ * that doesn't provide its own skip support function. There is no point in
+ * providing skip support unless the next indexed key value is often the next
+ * indexable value (at least with some workloads). Opclasses where that never
+ * works out in practice should just rely on the B-Tree AM's generic next-key
+ * fallback strategy. Opclasses where adding skip support is infeasible or
+ * hard (e.g., an opclass for a continuous type) can also use the fallback.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/skipsupport.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SKIPSUPPORT_H
+#define SKIPSUPPORT_H
+
+#include "utils/relcache.h"
+
+typedef struct SkipSupportData *SkipSupport;
+
+/*
+ * State/callbacks used by skip arrays to procedurally generate elements.
+ *
+ * A BTSKIPSUPPORT_PROC function must set each and every field when called.
+ * If an opclass can only set some of the fields, then it cannot safely
+ * provide a skip support routine.
+ */
+typedef struct SkipSupportData
+{
+ /*
+ * low_elem and high_elem must be set with the lowest and highest possible
+ * values from the domain of indexable values (assuming standard ascending
+ * order). This helps the B-Tree code with finding its initial position
+ * at the leaf level (during the skip scan's first primitive index scan).
+ * In other words, it gives the B-Tree code a useful value to start from,
+ * before any data has been read from the index.
+ *
+ * low_elem and high_elem are also used by skip scans to determine when
+ * they've reached the final possible value (in the current direction).
+ * It's typical for the scan to run out of leaf pages before it runs out
+ * of unscanned indexable values, but it's still useful for the scan to
+ * have a way to recognize when it has reached the last possible value
+ * (this saves us a useless probe that just lands on the final leaf page).
+ */
+ Datum low_elem; /* lowest sorting/leftmost non-NULL value */
+ Datum high_elem; /* highest sorting/rightmost non-NULL value */
+
+ /*
+ * Decrement/increment functions.
+ *
+ * Returns a decremented/incremented copy of caller's existing datum,
+ * allocated in caller's memory context (in the case of pass-by-reference
+ * types). It's not okay for these functions to leak any memory.
+ *
+ * Both decrement and increment callbacks are guaranteed to never be
+ * called with a NULL "existing" arg.
+ *
+ * When the decrement function (or increment function) is called with a
+ * value that already matches low_elem (or high_elem), function must set
+ * *underflow (or set *overflow). The return value is undefined when this
+ * happens. Opclass must not allocate memory for the undefined returned
+ * value, since the B-Tree code isn't required to free the memory.
+ *
+ * The B-Tree skip scan caller's "existing" datum is often just a straight
+ * copy of a value from an index tuple. Operator classes must be liberal
+ * in accepting every possible representational variation within the
+ * underlying data type. On the other hand, opclasses are _not_ expected
+ * to preserve any information that doesn't affect how datums are sorted
+ * (e.g., skip support for a fixed precision numeric type isn't required
+ * to preserve datum display scale).
+ */
+ Datum (*decrement) (Relation rel, Datum existing, bool *underflow);
+ Datum (*increment) (Relation rel, Datum existing, bool *overflow);
+} SkipSupportData;
+
+extern bool PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype,
+ bool reverse, SkipSupport sksup);
+
+#endif /* SKIPSUPPORT_H */
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 1c72867c8..deb387453 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -58,6 +58,7 @@
#include <limits.h>
#include "utils/fmgrprotos.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#ifdef STRESS_SORT_INT_MIN
@@ -78,6 +79,49 @@ btboolcmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32((int32) a - (int32) b);
}
+static Datum
+bool_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ if (bexisting == false)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return BoolGetDatum(bexisting - 1);
+}
+
+static Datum
+bool_increment(Relation rel, Datum existing, bool *overflow)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ if (bexisting == true)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return BoolGetDatum(bexisting + 1);
+}
+
+Datum
+btboolskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = bool_decrement;
+ sksup->increment = bool_increment;
+ sksup->low_elem = BoolGetDatum(false);
+ sksup->high_elem = BoolGetDatum(true);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint2cmp(PG_FUNCTION_ARGS)
{
@@ -105,6 +149,49 @@ btint2sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int2_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ if (iexisting == PG_INT16_MIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return Int16GetDatum(iexisting - 1);
+}
+
+static Datum
+int2_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ if (iexisting == PG_INT16_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return Int16GetDatum(iexisting + 1);
+}
+
+Datum
+btint2skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int2_decrement;
+ sksup->increment = int2_increment;
+ sksup->low_elem = Int16GetDatum(PG_INT16_MIN);
+ sksup->high_elem = Int16GetDatum(PG_INT16_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint4cmp(PG_FUNCTION_ARGS)
{
@@ -128,6 +215,49 @@ btint4sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int4_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ if (iexisting == PG_INT32_MIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return Int32GetDatum(iexisting - 1);
+}
+
+static Datum
+int4_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ if (iexisting == PG_INT32_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return Int32GetDatum(iexisting + 1);
+}
+
+Datum
+btint4skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int4_decrement;
+ sksup->increment = int4_increment;
+ sksup->low_elem = Int32GetDatum(PG_INT32_MIN);
+ sksup->high_elem = Int32GetDatum(PG_INT32_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint8cmp(PG_FUNCTION_ARGS)
{
@@ -171,6 +301,49 @@ btint8sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int8_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ if (iexisting == PG_INT64_MIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return Int64GetDatum(iexisting - 1);
+}
+
+static Datum
+int8_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ if (iexisting == PG_INT64_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return Int64GetDatum(iexisting + 1);
+}
+
+Datum
+btint8skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int8_decrement;
+ sksup->increment = int8_increment;
+ sksup->low_elem = Int64GetDatum(PG_INT64_MIN);
+ sksup->high_elem = Int64GetDatum(PG_INT64_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint48cmp(PG_FUNCTION_ARGS)
{
@@ -292,6 +465,49 @@ btoidsortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+oid_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ if (oexisting == InvalidOid)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return ObjectIdGetDatum(oexisting - 1);
+}
+
+static Datum
+oid_increment(Relation rel, Datum existing, bool *overflow)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ if (oexisting == OID_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return ObjectIdGetDatum(oexisting + 1);
+}
+
+Datum
+btoidskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = oid_decrement;
+ sksup->increment = oid_increment;
+ sksup->low_elem = ObjectIdGetDatum(InvalidOid);
+ sksup->high_elem = ObjectIdGetDatum(OID_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btoidvectorcmp(PG_FUNCTION_ARGS)
{
@@ -325,3 +541,48 @@ btcharcmp(PG_FUNCTION_ARGS)
/* Be careful to compare chars as unsigned */
PG_RETURN_INT32((int32) ((uint8) a) - (int32) ((uint8) b));
}
+
+static Datum
+char_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ if (cexisting == 0)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return CharGetDatum((uint8) cexisting - 1);
+}
+
+static Datum
+char_increment(Relation rel, Datum existing, bool *overflow)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ if (cexisting == UCHAR_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return CharGetDatum((uint8) cexisting + 1);
+}
+
+Datum
+btcharskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = char_decrement;
+ sksup->increment = char_increment;
+
+ /* btcharcmp compares chars as unsigned */
+ sksup->low_elem = UInt8GetDatum(0);
+ sksup->high_elem = UInt8GetDatum(UCHAR_MAX);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f..9c9cd48f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -324,11 +324,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
BTScanPosInvalidate(so->currPos);
BTScanPosInvalidate(so->markPos);
- if (scan->numberOfKeys > 0)
- so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
- else
- so->keyData = NULL;
+ so->keyData = NULL;
so->needPrimScan = false;
so->scanBehind = false;
so->arrayKeys = NULL;
@@ -408,6 +405,11 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->numberOfKeys = 0; /* until _bt_preprocess_keys sets it */
so->numArrayKeys = 0; /* ditto */
+
+ /* Release private storage allocated in previous btrescan, if any */
+ if (so->keyData != NULL)
+ pfree(so->keyData);
+ so->keyData = NULL;
}
/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 57bcfc7e4..a78b69f88 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -880,7 +880,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Buffer buf;
BTStack stack;
OffsetNumber offnum;
- StrategyNumber strat;
BTScanInsertData inskey;
ScanKey startKeys[INDEX_MAX_KEYS];
ScanKeyData notnullkeys[INDEX_MAX_KEYS];
@@ -1022,6 +1021,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
ScanKey chosen;
ScanKey impliesNN;
ScanKey cur;
+ int ikey = 0,
+ ichosen = 0;
/*
* chosen is the so-far-chosen key for the current attribute, if any.
@@ -1042,6 +1043,80 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
{
if (i >= so->numberOfKeys || cur->sk_attno != curattr)
{
+ /*
+ * Conceptually, skip arrays consist of array elements whose
+ * values are generated procedurally and on demand. We need
+ * special handling for that here.
+ *
+ * We must interpret various sentinel values to generate an
+ * insertion scan key. This is only actually needed for index
+ * attributes whose input opclass lacks a skip support routine
+ * (when skip support is available we'll always be able to
+ * generate true array element datum values instead).
+ */
+ if (chosen && chosen->sk_flags & (SK_BT_NEG_INF | SK_BT_POS_INF))
+ {
+ BTArrayKeyInfo *array = NULL;
+
+ Assert(chosen->sk_flags & SK_BT_SKIP);
+ Assert(!(chosen->sk_flags & (SK_BT_NEXTKEY | SK_BT_PREVKEY)));
+
+ for (; ikey < so->numArrayKeys; ikey++)
+ {
+ array = &so->arrayKeys[ikey];
+ if (array->scan_key == ichosen)
+ break;
+ }
+
+ Assert(array->scan_key == ichosen);
+ Assert(array->num_elems == -1);
+
+ if (array->null_elem)
+ {
+ /*
+ * Treat the chosen scan key as having the value -inf
+ * (or the value +inf, in the backwards scan case) by
+ * not appending it to the local startKeys[] array.
+ */
+ Assert(!array->low_compare);
+ Assert(!array->high_compare);
+ break; /* done adding entries to startKeys[] */
+ }
+ else if ((chosen->sk_flags & SK_BT_NEG_INF) &&
+ array->low_compare)
+ {
+ Assert(ScanDirectionIsForward(dir));
+
+ /* use array's inequality key in startKeys[] */
+ chosen = array->low_compare;
+ }
+ else if ((chosen->sk_flags & SK_BT_POS_INF) &&
+ array->high_compare)
+ {
+ Assert(ScanDirectionIsBackward(dir));
+
+ /* use array's inequality key in startKeys[] */
+ chosen = array->high_compare;
+ }
+ else
+ {
+ /*
+ * Array doesn't have any explicit low_compare or
+ * high_compare that we can use (given the current
+ * scan direction). The array does not include a NULL
+ * element (to generate an IS NULL qual), though, so
+ * we might need to deduce a NOT NULL key to skip over
+ * any NULLs. Prepare for that.
+ *
+ * Note: this is also how we handle an explicit NOT
+ * NULL key that preprocessing folded into the skip
+ * array.
+ */
+ impliesNN = chosen;
+ chosen = NULL;
+ }
+ }
+
/*
* Done looking at keys for curattr. If we didn't find a
* usable boundary key, see if we can deduce a NOT NULL key.
@@ -1075,16 +1150,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
break;
startKeys[keysz++] = chosen;
+ /*
+ * Skip arrays can also use a sk_argument which is marked
+ * "next key". This is another sentinel array element value
+ * requiring special handling here by us. As with -inf/+inf
+ * sentinels, there cannot be any exact non-pivot matches.
+ */
+ if (chosen->sk_flags & (SK_BT_NEXTKEY | SK_BT_PREVKEY))
+ {
+ Assert(chosen->sk_flags & SK_BT_SKIP);
+ Assert(!(chosen->sk_flags & (SK_BT_NEG_INF | SK_BT_POS_INF)));
+ Assert(chosen->sk_strategy == BTEqualStrategyNumber);
+
+ /*
+ * Adjust strat_total, so that our = key gets treated like
+ * a > key (or like a < key)
+ */
+ if (chosen->sk_flags & SK_BT_NEXTKEY)
+ strat_total = BTGreaterStrategyNumber;
+ else
+ strat_total = BTLessStrategyNumber;
+ break;
+ }
+
/*
* Adjust strat_total, and quit if we have stored a > or <
* key.
*/
- strat = chosen->sk_strategy;
- if (strat != BTEqualStrategyNumber)
+ if (chosen->sk_strategy != BTEqualStrategyNumber)
{
- strat_total = strat;
- if (strat == BTGreaterStrategyNumber ||
- strat == BTLessStrategyNumber)
+ strat_total = chosen->sk_strategy;
+ if (chosen->sk_strategy == BTGreaterStrategyNumber ||
+ chosen->sk_strategy == BTLessStrategyNumber)
break;
}
@@ -1103,6 +1200,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
curattr = cur->sk_attno;
chosen = NULL;
impliesNN = NULL;
+ ichosen = -1;
}
/*
@@ -1127,6 +1225,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
case BTEqualStrategyNumber:
/* override any non-equality choice */
chosen = cur;
+ ichosen = i;
break;
case BTGreaterEqualStrategyNumber:
case BTGreaterStrategyNumber:
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d6de2072d..5260a929a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -29,9 +29,37 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+/*
+ * GUC parameters (temporary convenience for reviewers).
+ *
+ * To disable all skipping, set skipscan_prefix_cols=0. Otherwise set it to
+ * the attribute number that you wish to make the last attribute number that
+ * we can add a skip scan key for. For example, skipscan_prefix_cols=1 makes
+ * an index scan with qual "WHERE b = 1 AND c > 42" generate a skip scan key
+ * on the column 'a' (which is attnum 1) only, preventing us from adding one
+ * for the column 'c' (and so 'c' will still have an inequality scan key,
+ * required in only one direction -- 'c' won't be output as a "range" skip
+ * key/array).
+ */
+int skipscan_prefix_cols = INDEX_MAX_KEYS;
+
+/*
+ * skipscan_skipsupport_enabled can be used to avoid using opclass skip
+ * support routines. This can be used to quantify the peformance benefit that
+ * comes from having dedicated skip support, with a given test query.
+ */
+bool skipscan_skipsupport_enabled = true;
+
#define LOOK_AHEAD_REQUIRED_RECHECKS 3
#define LOOK_AHEAD_DEFAULT_DISTANCE 5
+typedef struct BTSkipPreproc
+{
+ SkipSupportData sksup; /* opclass skip scan support (optional) */
+ bool use_sksup; /* sksup set to valid routine? */
+ Oid eq_op; /* InvalidOid means don't skip */
+} BTSkipPreproc;
+
typedef struct BTSortArrayContext
{
FmgrInfo *sortproc;
@@ -62,22 +90,49 @@ static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
ScanKey arraysk, ScanKey skey,
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok);
-static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys);
static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
+static int _bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts);
+static bool _bt_skipsupport(Relation rel, int add_skip_attno,
+ BTSkipPreproc *skipatts);
+static inline Datum _bt_skipsupport_decrement(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array, bool *underflow);
+static inline Datum _bt_skipsupport_increment(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array, bool *overflow);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur);
+ Datum arrdatum, bool arrnull,
+ ScanKey cur);
+static void _bt_array_preproc_shrink(ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
+static bool _bt_skip_preproc_shrink(IndexScanDesc scan, ScanKey arraysk,
+ ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
static int _bt_binsrch_array_skey(FmgrInfo *orderproc,
bool cur_elem_trig, ScanDirection dir,
Datum tupdatum, bool tupnull,
BTArrayKeyInfo *array, ScanKey cur,
int32 *set_elem_result);
+static void _bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result);
+static void _bt_scankey_set_low_or_high(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array, bool low_not_high);
+static void _bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull);
+static void _bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static bool _bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static bool _bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
- bool readpagetup, int sktrig, bool *scanBehind);
+ bool readpagetup, int sktrig, bool *scanBehind,
+ bool infbefore);
static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
int sktrig, bool sktrig_required);
@@ -251,9 +306,6 @@ _bt_freestack(BTStack stack)
* It is convenient for _bt_preprocess_keys caller to have to deal with no
* more than one equality strategy array scan key per index attribute. We'll
* always be able to set things up that way when complete opfamilies are used.
- * Eliminated array scan keys can be recognized as those that have had their
- * sk_strategy field set to InvalidStrategy here by us. Caller should avoid
- * including these in the scan's so->keyData[] output array.
*
* We set the scan key references from the scan's BTArrayKeyInfo info array to
* offsets into the temp modified input array returned to caller. Scans that
@@ -261,18 +313,36 @@ _bt_freestack(BTStack stack)
* preprocessing steps are complete. This will convert the scan key offset
* references into references to the scan's so->keyData[] output scan keys.
*
+ * We're also responsible for generating skip arrays (and their associated
+ * scan keys) here. This enables skip scan. We do this for index attributes
+ * that initially lacked an equality condition within scan->keyData[], iff
+ * doing so allows a later scan key (that was passed to us in scan->keyData[])
+ * to be marked required by later preprocessing on output.
+ * _bt_decide_skipatts decides which attributes receive skip arrays.
+ *
+ * Caller must pass *numberOfKeys to give us a way to change the number of
+ * input scan keys (our output is caller's input). The returned array can be
+ * smaller than scan->keyData[] when we eliminated a redundant array scan key
+ * (redundant with some other array scan key, for the same attribute). It can
+ * also be larger when we added a skip array/skip scan key. Caller uses this
+ * to allocate so->keyData[] for the current btrescan.
+ *
* Note: the reason we need to return a temp scan key array, rather than just
* scribbling on scan->keyData, is that callers are permitted to call btrescan
* without supplying a new set of scankey data.
*/
static ScanKey
-_bt_preprocess_array_keys(IndexScanDesc scan)
+_bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- int numberOfKeys = scan->numberOfKeys;
+ int numArrayKeyData = scan->numberOfKeys;
int16 *indoption = rel->rd_indoption;
- int numArrayKeys;
+ BTSkipPreproc skipatts[INDEX_MAX_KEYS];
+ int numArrayKeys,
+ numSkipArrayKeys,
+ output_ikey = 0;
+ AttrNumber attno_skip = 1;
int origarrayatt = InvalidAttrNumber,
origarraykey = -1;
Oid origelemtype = InvalidOid;
@@ -280,11 +350,14 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
MemoryContext oldContext;
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- Assert(numberOfKeys);
+ Assert(scan->numberOfKeys);
- /* Quick check to see if there are any array keys */
+ /*
+ * Quick check to see if there are any array keys, or any missing keys we
+ * can generate a "skip scan" array key for ourselves
+ */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int i = 0; i < scan->numberOfKeys; i++)
{
cur = &scan->keyData[i];
if (cur->sk_flags & SK_SEARCHARRAY)
@@ -300,6 +373,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
}
}
+ /* Consider generating skip arrays, and associated equality scan keys */
+ numSkipArrayKeys = _bt_decide_skipatts(scan, skipatts);
+ if (numSkipArrayKeys)
+ {
+ /* At least one skip array scan key must be added to arrayKeyData[] */
+ numArrayKeys += numSkipArrayKeys;
+ /* output scan key buffer allocation needs space for skip scan keys */
+ numArrayKeyData += numSkipArrayKeys;
+ }
+
/* Quit if nothing to do. */
if (numArrayKeys == 0)
return NULL;
@@ -317,19 +400,23 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
oldContext = MemoryContextSwitchTo(so->arrayContext);
- /* Create modifiable copy of scan->keyData in the workspace context */
- arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
- memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
+ /* Create output scan keys in the workspace context */
+ arrayKeyData = (ScanKey) palloc(numArrayKeyData * sizeof(ScanKeyData));
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
/* Allocate space for ORDER procs used to help _bt_checkkeys */
- so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+ so->orderProcs = (FmgrInfo *) palloc(numArrayKeyData * sizeof(FmgrInfo));
- /* Now process each array key */
+ /*
+ * Process each array key, and generate skip arrays as needed. Also copy
+ * every scan->keyData[] input scan key (whether it's an array or not)
+ * into the arrayKeyData array we'll return to our caller (barring any
+ * array scan keys that we could eliminate early through array merging).
+ */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int input_ikey = 0; input_ikey < scan->numberOfKeys; input_ikey++)
{
FmgrInfo sortproc;
FmgrInfo *sortprocp = &sortproc;
@@ -345,14 +432,88 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int num_nonnulls;
int j;
- cur = &arrayKeyData[i];
- if (!(cur->sk_flags & SK_SEARCHARRAY))
- continue;
+ /* Create a skip array and scan key where indicated by skipatts */
+ while (numSkipArrayKeys &&
+ attno_skip <= scan->keyData[input_ikey].sk_attno)
+ {
+ Oid opcintype = rel->rd_opcintype[attno_skip - 1];
+ Oid collation = rel->rd_indcollation[attno_skip - 1];
+ Oid eq_op = skipatts[attno_skip - 1].eq_op;
+ RegProcedure cmp_proc;
+
+ if (!OidIsValid(eq_op))
+ {
+ /* won't skip using this attribute */
+ attno_skip++;
+ continue;
+ }
+
+ cmp_proc = get_opcode(eq_op);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing oprcode for skipping equals operator %u", eq_op);
+
+ cur = &arrayKeyData[output_ikey];
+ Assert(attno_skip <= scan->keyData[input_ikey].sk_attno);
+ ScanKeyEntryInitialize(cur,
+ SK_SEARCHARRAY | SK_BT_SKIP, /* flags */
+ attno_skip, /* skipped att number */
+ BTEqualStrategyNumber, /* equality strategy */
+ InvalidOid, /* opclass input subtype */
+ collation, /* index column's collation */
+ cmp_proc, /* equality operator's proc */
+ (Datum) 0); /* constant */
+
+ /* Initialize array fields */
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
+ so->arrayKeys[numArrayKeys].num_elems = -1;
+ so->arrayKeys[numArrayKeys].cur_elem = 0;
+ so->arrayKeys[numArrayKeys].elem_values = NULL; /* unusued */
+ so->arrayKeys[numArrayKeys].use_sksup = skipatts[attno_skip - 1].use_sksup;
+ so->arrayKeys[numArrayKeys].null_elem = true; /* for now */
+ so->arrayKeys[numArrayKeys].sksup = skipatts[attno_skip - 1].sksup;
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* for now */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* for now */
+
+ /*
+ * Temporary testing GUC can disable the use of an opclass's skip
+ * support routine
+ */
+ if (!skipscan_skipsupport_enabled)
+ so->arrayKeys[numArrayKeys].use_sksup = false;
+
+ /*
+ * We'll need a 3-way ORDER proc to determine when and how the
+ * consed-up "array" will advance inside _bt_advance_array_keys.
+ * Set one up now.
+ */
+ _bt_setup_array_cmp(scan, cur, opcintype,
+ &so->orderProcs[output_ikey], NULL);
+
+ /*
+ * Prepare to output next scan key (might be another skip scan
+ * key, or it could be an input scan key from scan->keyData[])
+ */
+ numSkipArrayKeys--;
+ numArrayKeys++;
+ attno_skip++;
+ output_ikey++; /* keep this scan key/array */
+ }
/*
- * First, deconstruct the array into elements. Anything allocated
- * here (including a possibly detoasted array value) is in the
- * workspace context.
+ * Copy input scan key into temp arrayKeyData scan key array. (From
+ * here on, cur points at our copy of the input scan key.)
+ */
+ cur = &arrayKeyData[output_ikey];
+ *cur = scan->keyData[input_ikey];
+
+ if (!(cur->sk_flags & SK_SEARCHARRAY))
+ {
+ output_ikey++; /* keep this non-array scan key */
+ continue;
+ }
+
+ /*
+ * Deconstruct the array into elements
*/
arrayval = DatumGetArrayTypeP(cur->sk_argument);
/* We could cache this data, but not clear it's worth it */
@@ -406,6 +567,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTGreaterStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
case BTEqualStrategyNumber:
/* proceed with rest of loop */
@@ -416,6 +578,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTLessStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
default:
elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -432,7 +595,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* sortproc just points to the same proc used during binary searches.
*/
_bt_setup_array_cmp(scan, cur, elemtype,
- &so->orderProcs[i], &sortprocp);
+ &so->orderProcs[output_ikey], &sortprocp);
/*
* Sort the non-null elements and eliminate any duplicates. We must
@@ -476,11 +639,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
break;
}
- /*
- * Indicate to _bt_preprocess_keys caller that it must ignore
- * this scan key
- */
- cur->sk_strategy = InvalidStrategy;
+ /* Throw away this array */
continue;
}
@@ -511,12 +670,19 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* Note: _bt_preprocess_array_keys_final will fix-up each array's
* scan_key field later on, after so->keyData[] has been finalized.
*/
- so->arrayKeys[numArrayKeys].scan_key = i;
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
+ so->arrayKeys[numArrayKeys].null_elem = false; /* unused */
+ so->arrayKeys[numArrayKeys].use_sksup = false; /* redundant */
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* unused */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* unused */
numArrayKeys++;
+ output_ikey++; /* keep this scan key/array */
}
+ /* Set final number of arrayKeyData[] keys, array keys */
+ *numberOfKeys = output_ikey;
so->numArrayKeys = numArrayKeys;
MemoryContextSwitchTo(oldContext);
@@ -624,7 +790,8 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
{
BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
- Assert(array->num_elems > 0);
+ Assert(array->num_elems > 0 || array->num_elems == -1);
+ Assert(array->num_elems != -1 || outkey->sk_flags & SK_BT_REQFWD);
if (array->scan_key == input_ikey)
{
@@ -685,6 +852,253 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
so->numArrayKeys, INDEX_MAX_KEYS)));
}
+/*
+ * _bt_decide_skipatts() -- set index attributes requiring skip arrays
+ *
+ * _bt_preprocess_array_keys helper function. Determines which attributes
+ * will require skip arrays/scan keys. Also sets up skip support callbacks
+ * for attributes whose input opclass have skip support (opclasses without
+ * skip support will fall back on using next-key sentinel values when
+ * advancing the skip array to its next array element).
+ *
+ * Return value is the total number of scan keys to add as "input" scan keys
+ * for further processing within _bt_preprocess_keys.
+ */
+static int
+_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)
+{
+ Relation rel = scan->indexRelation;
+ ScanKey inputsk;
+ AttrNumber attno_inputsk = 1,
+ attno_skip = 1;
+ bool attno_has_equal = false,
+ attno_has_rowcompare = false;
+ int numSkipArrayKeys = 0,
+ prev_numSkipArrayKeys = 0;
+
+ Assert(scan->numberOfKeys);
+
+ /*
+ * FIXME Don't support parallel index scans for now.
+ *
+ * _bt_parallel_primscan_schedule must be taught to account for skip
+ * arrays. This is likely to require that we store the current array
+ * element datum in shared memory.
+ */
+ if (scan->parallel_scan)
+ return 0;
+
+ /*
+ * Only add skip arrays (and associated scan keys) when doing so will
+ * enable _bt_preprocess_keys to mark one or more lower-order input scan
+ * keys (user-visible scan keys taken from scan->keyData[] input array) as
+ * required to continue the scan.
+ */
+ inputsk = &scan->keyData[0];
+ for (int i = 0;; inputsk++, i++)
+ {
+ /*
+ * Backfill skip arrays for any wholly omitted attributes prior to
+ * attno_inputsk
+ */
+ while (attno_skip < attno_inputsk)
+ {
+ if (!_bt_skipsupport(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /*
+ * Opclass lacks a suitable skip support routine.
+ *
+ * Return prev_numSkipArrayKeys, so as to avoid including any
+ * "backfilled" arrays that were supposed to form a contiguous
+ * group with a skip array on this attribute. There is no
+ * benefit to adding backfill skip arrays unless we can do so
+ * for all attributes (all attributes up to and including the
+ * one immediately before attno_inputsk).
+ */
+ return prev_numSkipArrayKeys;
+ }
+
+ /* plan on adding a backfill skip array for this attribute */
+ numSkipArrayKeys++;
+ attno_skip++;
+ }
+
+ /*
+ * Stop once past the final input scan key. We deliberately never add
+ * a skip attribute for the attribute of the last input scan key.
+ *
+ * If the last input scan key(s) use equality strategy, then a skip
+ * attribute is superfluous at best. If the last input scan key uses
+ * an inequality strategy, then adding a skip scan array/scan key is a
+ * valid though suboptimal transformation. It is better to arrange
+ * for preprocessing to allow such an input inequality scan key to
+ * remain an inequality on output. That way _bt_checkkeys will be
+ * able to make best use of both of its precheck optimizations, but
+ * _bt_first will be no less capable of efficiently finding the
+ * starting position for each primitive index scan.
+ */
+ if (i >= scan->numberOfKeys)
+ break;
+
+ /*
+ * Cannot keep adding skip arrays after a RowCompare
+ */
+ if (attno_has_rowcompare)
+ break;
+
+ /*
+ * Apply temporary testing GUC that can be used to disable skipping
+ * (either in part or in whole)
+ */
+ if (attno_inputsk > skipscan_prefix_cols)
+ break;
+
+ /*
+ * Now consider next attno_inputsk (or keep going if this is an
+ * additional scan key against the same attribute)
+ */
+ if (attno_inputsk < inputsk->sk_attno)
+ {
+ prev_numSkipArrayKeys = numSkipArrayKeys;
+
+ /*
+ * Now add skip array for previous scan key's attribute, though
+ * only if the attribute has no equality strategy scan keys.
+ *
+ * Adding skip arrays to an attribute that has one or more
+ * inequality scan keys will cause preprocessing to output a range
+ * skip array. This will happen when preprocessing proper deals
+ * with the redundancy between the array and its inequalities.
+ */
+ skipatts[attno_skip - 1].eq_op = InvalidOid;
+ if (!attno_has_equal)
+ {
+ /* Only saw inequalities for the prior attribute */
+ if (_bt_skipsupport(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /* add a range skip array for this attribute */
+ numSkipArrayKeys++;
+ }
+ else
+ break;
+ }
+ else
+ {
+ /*
+ * Saw an equality for the prior attribute, so it doesn't need
+ * a skip array (not even a range skip array). We'll be able
+ * to add later skip arrays, too (doesn't matter if the prior
+ * attribute uses an input opclass without skip support).
+ */
+ }
+
+ /* Set things up for this new attribute */
+ attno_skip++;
+ attno_inputsk = inputsk->sk_attno;
+ attno_has_equal = false;
+ }
+
+ /*
+ * Track if this scan key's attribute has any equality strategy scan
+ * keys.
+ *
+ * Treat IS NULL scan keys as using equal strategy (they'll be marked
+ * as using it later on, by _bt_fix_scankey_strategy).
+ */
+ if (inputsk->sk_strategy == BTEqualStrategyNumber ||
+ (inputsk->sk_flags & SK_SEARCHNULL))
+ attno_has_equal = true;
+
+ /*
+ * We don't support RowCompare transformation. Remember that we saw a
+ * RowCompare, so that we don't keep adding skip attributes.
+ *
+ * We do still backfill skip attributes before the RowCompare, so that
+ * it can be marked required. This is similar to what happens when a
+ * conventional inequality uses an opclass that lacks skip support.
+ */
+ if (inputsk->sk_flags & SK_ROW_HEADER)
+ attno_has_rowcompare = true;
+ }
+
+ return numSkipArrayKeys;
+}
+
+/*
+ * _bt_skipsupport() -- set up skip support function in *skipatts
+ *
+ * Returns true on success, indicating that we set *skipatts with input
+ * opclass's equality operator. Otherwise returns false.
+ */
+static bool
+_bt_skipsupport(Relation rel, int add_skip_attno, BTSkipPreproc *skipatts)
+{
+ int16 *indoption = rel->rd_indoption;
+ Oid opfamily = rel->rd_opfamily[add_skip_attno - 1];
+ Oid opcintype = rel->rd_opcintype[add_skip_attno - 1];
+ bool reverse;
+
+ /* Look up input opclass's equality operator (might fail) */
+ skipatts->eq_op = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ /*
+ * We don't really expect input opclasses lacking even an equality
+ * operator, but they're still supported. Deal with them gracefully.
+ */
+ if (!OidIsValid(skipatts->eq_op))
+ return false;
+
+ /* Have skip support infrastructure set all SkipSupport fields */
+ reverse = (indoption[add_skip_attno - 1] & INDOPTION_DESC) != 0;
+ skipatts->use_sksup = PrepareSkipSupportFromOpclass(opfamily, opcintype,
+ reverse,
+ &skipatts->sksup);
+
+ /* might not have set up skip support routine, but can skip either way */
+ return true;
+}
+
+/*
+ * _bt_skipsupport_decrement() -- Get a decremented copy of skey's arg
+ *
+ * Sets *underflow for caller. Returns a valid decremented value (allocated
+ * in caller's memory context for pass-by-reference types) when *underflow is
+ * set to 'false'. Otherwise returns an undefined value that caller doesn't
+ * have to pfree.
+ */
+static inline Datum
+_bt_skipsupport_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ bool *underflow)
+{
+ Assert(array->use_sksup);
+
+ if (!(skey->sk_flags & SK_BT_DESC))
+ return array->sksup.decrement(rel, skey->sk_argument, underflow);
+ else
+ return array->sksup.increment(rel, skey->sk_argument, underflow);
+}
+
+/*
+ * _bt_skipsupport_increment() -- Get an incremented copy of skey's arg
+ *
+ * Sets *overflow for caller. Returns a valid incremented value (allocated in
+ * caller's memory context for pass-by-reference types) when *overflow is set
+ * to 'false'. Otherwise returns an undefined value that caller doesn't have
+ * to pfree.
+ */
+static inline Datum
+_bt_skipsupport_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ bool *overflow)
+{
+ Assert(array->use_sksup);
+
+ if (!(skey->sk_flags & SK_BT_DESC))
+ return array->sksup.increment(rel, skey->sk_argument, overflow);
+ else
+ return array->sksup.decrement(rel, skey->sk_argument, overflow);
+}
+
/*
* _bt_setup_array_cmp() -- Set up array comparison functions
*
@@ -977,17 +1391,15 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok)
{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
Oid opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
- int cmpresult = 0,
- cmpexact = 0,
- matchelem,
- new_nelems = 0;
FmgrInfo crosstypeproc;
FmgrInfo *orderprocp = orderproc;
+ MemoryContext oldContext;
+ bool eliminated;
Assert(arraysk->sk_attno == skey->sk_attno);
- Assert(array->num_elems > 0);
Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
arraysk->sk_strategy == BTEqualStrategyNumber);
@@ -1000,8 +1412,8 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
* datum of opclass input type for the index's attribute (on-disk type).
* We can reuse the array's ORDER proc whenever the non-array scan key's
* type is a match for the corresponding attribute's input opclass type.
- * Otherwise, we have to do another ORDER proc lookup so that our call to
- * _bt_binsrch_array_skey applies the correct comparator.
+ * Otherwise, we have to do another ORDER proc lookup. We have to be sure
+ * that _bt_compare_array_skey/_bt_binsrch_array_skey use the right proc.
*
* Note: we have to support the convention that sk_subtype == InvalidOid
* means the opclass input type; this is a hack to simplify life for
@@ -1032,11 +1444,65 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
return false;
}
- /* We have all we need to determine redundancy/contradictoriness */
+ /* We successfully looked up the required cross-type ORDER proc */
orderprocp = &crosstypeproc;
fmgr_info(cmp_proc, orderprocp);
}
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+
+ /*
+ * Perform preprocessing of the array based on whether it's a conventional
+ * array, or a skip array. Sets *qual_ok correctly in passing.
+ */
+ if (array->num_elems != -1)
+ {
+ _bt_array_preproc_shrink(arraysk, skey, orderprocp, array, qual_ok);
+
+ /*
+ * We successfully looked up the required cross-type ORDER proc, which
+ * ensured that the scalar scan key could be eliminated as redundant
+ */
+ eliminated = true;
+ }
+ else
+ {
+ /*
+ * With a skip array it's possible that we won't be able to eliminate
+ * the scalar scan key, despite looking up the required ORDER proc.
+ * This happens when earlier preprocessing wasn't able to eliminate a
+ * redundant scan key inequality due to a lack of cross-type support.
+ */
+ eliminated = _bt_skip_preproc_shrink(scan, arraysk, skey, orderprocp,
+ array, qual_ok);
+ }
+
+ MemoryContextSwitchTo(oldContext);
+
+ return eliminated;
+}
+
+/*
+ * Finish off preprocessing of conventional (non-skip) array scan key when it
+ * is redundant with (or contradicted by) a non-array scalar scan key.
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Rewrites caller's array in-place as needed to eliminate redundant array
+ * elements. Calling here always renders caller's scalar scan key redundant.
+ */
+static void
+_bt_array_preproc_shrink(ScanKey arraysk, ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ int cmpresult = 0,
+ cmpexact = 0,
+ matchelem,
+ new_nelems = 0;
+
+ Assert(array->num_elems > 0);
+ Assert(!(arraysk->sk_flags & SK_BT_SKIP));
+
matchelem = _bt_binsrch_array_skey(orderprocp, false,
NoMovementScanDirection,
skey->sk_argument, false, array,
@@ -1088,6 +1554,137 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
array->num_elems = new_nelems;
*qual_ok = new_nelems > 0;
+}
+
+/*
+ * Finish off preprocessing of skip array scan key when it is "redundant with"
+ * a non-array scalar scan key. The scalar scan key must be an inequality.
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Unlike _bt_array_preproc_shrink, we cannot really modify caller's array
+ * in-place. Skip arrays work by procedurally generating their elements as
+ * needed, so our approach is to store a copy of the inequality in the skip
+ * array, allowing its elements to be generated within the limits of a range.
+ * Calling here always renders caller's scalar scan key redundant (the key is
+ * applied when the array advances, but that's just an implementation detail).
+ *
+ * Return value indicates if the array already had a lower/upper bound
+ * (whichever caller's scalar scan key was expected to be). We return true in
+ * the common case where caller's scan key could be successfully rolled into
+ * the skip array. We return false when we can't do that due to the presence
+ * of a conflicting inequality.
+ */
+static bool
+_bt_skip_preproc_shrink(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp, BTArrayKeyInfo *array,
+ bool *qual_ok)
+{
+ bool test_result;
+
+ /*
+ * We don't expect to have to deal with NULLs in non-array/non-skip scan
+ * key. We expect _bt_preprocess_array_keys to avoid generating a skip
+ * array for an index attribute with an IS NULL input scan key. (It will
+ * still do so in the presence of IS NOT NULL input scan keys, but
+ * _bt_compare_scankey_args is expected to handle those for us.)
+ */
+ Assert(arraysk->sk_flags & SK_BT_SKIP);
+ Assert(arraysk->sk_flags & SK_SEARCHARRAY);
+ Assert(arraysk->sk_strategy == BTEqualStrategyNumber);
+ Assert(array->num_elems == -1);
+
+ /* Scalar scan key must be a B-Tree inequality, which are always strict */
+ Assert(!(skey->sk_flags & SK_ISNULL));
+ Assert(skey->sk_strategy != BTEqualStrategyNumber);
+
+ /*
+ * Array must not generate a NULL array element (for "IS NULL" qual). Its
+ * index attribute is constrained by a strict operator, so NULL elements
+ * must not be returned by the scan (it would be wrong to allow it).
+ */
+ array->null_elem = false;
+ *qual_ok = true;
+
+ /*
+ * Store a copy of caller's scalar scan key, plus a copy of the operator's
+ * corresponding 3-way ORDER proc.
+ *
+ * A skip array scan key always uses the underlying index attribute's
+ * input opclass, but it's possible that caller's scalar scan key uses a
+ * cross-type operator. In cross-type scenarios, skey.sk_argument doesn't
+ * use the same type as later array elements (which are all just copies of
+ * datums taken from index tuples, possibly modified by skip support).
+ *
+ * We represent the lowest (and highest) possible value in the array using
+ * the sentinel value -inf (+inf for high_compare). The only exceptions
+ * apply when the opclass has skip support: there we can use a copy of the
+ * skip support routine's low_elem/high_elem instead -- though only when
+ * there is no corresponding low_compare/high_compare inequality.
+ *
+ * _bt_first understands that -inf/+inf indicate that it should use the
+ * low_compare/high_compare inequality for initial positioning purposes
+ * when it sees either value (unless there is no corresponding inequality,
+ * in which case the values are literally interpreted as -inf or +inf).
+ * _bt_first can therefore vary in whether it uses a cross-type operator,
+ * or an input-opclass-only operator (it can vary across primitive scans
+ * for the same index attribute/skip array).
+ *
+ * _bt_scankey_decrement/_bt_scankey_increment both make sure that each
+ * newly generated element is constrained by low_compare/high_compare.
+ * This must happen without skey.sk_argument ever being treated as a true
+ * array element (that wouldn't always work because array elements are
+ * only ever supposed to use the opclass input type).
+ */
+ switch (skey->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ case BTLessEqualStrategyNumber:
+ if (array->high_compare)
+ {
+ /* try to keep only one high_compare inequality */
+ if (!_bt_compare_scankey_args(scan, array->high_compare, skey,
+ array->high_compare, NULL, NULL,
+ &test_result))
+ return false; /* can't make new high_compare redundant */
+
+ if (!test_result)
+ return true; /* discard new high_compare */
+
+ /* replace old high_compare with new one */
+ }
+ else
+ array->high_compare = palloc(sizeof(ScanKeyData));
+
+ memcpy(array->high_compare, skey, sizeof(ScanKeyData));
+ array->order_high = *orderprocp;
+ break;
+ case BTGreaterEqualStrategyNumber:
+ case BTGreaterStrategyNumber:
+ if (array->low_compare)
+ {
+ /* try to keep only one low_compare inequality */
+ if (!_bt_compare_scankey_args(scan, array->low_compare, skey,
+ array->low_compare, NULL, NULL,
+ &test_result))
+ return false; /* can't make new low_compare redundant */
+
+ if (!test_result)
+ return true; /* discard new low_compare */
+
+ /* replace old low_compare with new one */
+ }
+ else
+ array->low_compare = palloc(sizeof(ScanKeyData));
+
+ memcpy(array->low_compare, skey, sizeof(ScanKeyData));
+ array->order_low = *orderprocp;
+ break;
+ default:
+ elog(ERROR, "unrecognized StrategyNumber: %d",
+ (int) skey->sk_strategy);
+ break;
+ }
return true;
}
@@ -1130,7 +1727,8 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
static inline int32
_bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur)
+ Datum arrdatum, bool arrnull,
+ ScanKey cur)
{
int32 result = 0;
@@ -1138,14 +1736,14 @@ _bt_compare_array_skey(FmgrInfo *orderproc,
if (tupnull) /* NULL tupdatum */
{
- if (cur->sk_flags & SK_ISNULL)
+ if (arrnull)
result = 0; /* NULL "=" NULL */
else if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+ else if (arrnull) /* NOT_NULL tupdatum, NULL arrdatum */
{
if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -1211,6 +1809,8 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
Datum arrdatum;
Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(!(cur->sk_flags & SK_BT_SKIP));
+ Assert(!(cur->sk_flags & SK_ISNULL)); /* plain arrays can't do this */
Assert(cur->sk_strategy == BTEqualStrategyNumber);
if (cur_elem_trig)
@@ -1246,7 +1846,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[low_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result <= 0)
{
@@ -1274,7 +1874,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[high_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result >= 0)
{
@@ -1301,7 +1901,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
arrdatum = array->elem_values[mid_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result == 0)
{
@@ -1326,13 +1926,70 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
*/
if (low_elem != mid_elem)
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- array->elem_values[low_elem], cur);
+ array->elem_values[low_elem], false,
+ cur);
*set_elem_result = result;
return low_elem;
}
+/*
+ * _bt_binsrch_skiparray_skey() -- "Binary search" within a skip array
+ *
+ * This routine doesn't return an index into the array, because the array
+ * doesn't actually have any elements (it generates its array elements
+ * procedurally instead). Note that this may include a NULL value/an IS NULL
+ * qual.
+ *
+ * Sets *set_elem_result just like _bt_binsrch_array_skey would with a true
+ * array. The value 0 indicates that tupdatum/tupnull is within the range of
+ * the skip array. Other values indicate what _bt_compare_array_skey returned
+ * for the best available match to tupdatum/tupnull (in practice this means
+ * either the lowest item or the highest item in the range of the array).
+ */
+static void
+_bt_binsrch_skiparray_skey(FmgrInfo *orderproc, Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result)
+{
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_flags & SK_BT_REQFWD);
+ Assert(array->num_elems == -1);
+
+ if (tupnull) /* NULL tupdatum */
+ {
+ if (array->null_elem)
+ *set_elem_result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ *set_elem_result = -1; /* NULL "<" NOT_NULL */
+ else
+ *set_elem_result = 1; /* NULL ">" NOT_NULL */
+
+ return;
+ }
+
+ /*
+ * Array inequalities determine whether tupdatum is within the range of
+ * caller's skip array
+ */
+ if (array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ tupdatum,
+ array->low_compare->sk_argument)))
+ *set_elem_result = -1;
+ else if (array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ tupdatum,
+ array->high_compare->sk_argument)))
+ *set_elem_result = 1;
+ else
+ *set_elem_result = 0;
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -1342,29 +1999,498 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
void
_bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- int i;
Assert(so->numArrayKeys);
Assert(so->qual_ok);
- for (i = 0; i < so->numArrayKeys; i++)
+ for (int i = 0; i < so->numArrayKeys; i++)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
- Assert(curArrayKey->num_elems > 0);
Assert(skey->sk_flags & SK_SEARCHARRAY);
- if (ScanDirectionIsBackward(dir))
- curArrayKey->cur_elem = curArrayKey->num_elems - 1;
- else
- curArrayKey->cur_elem = 0;
- skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
+ _bt_scankey_set_low_or_high(rel, skey, curArrayKey,
+ ScanDirectionIsForward(dir));
}
so->scanBehind = false;
}
+/*
+ * _bt_scankey_set_low_or_high() -- Set array scan key to lowest/highest element
+ *
+ * Caller also passes associated scan key, which will have its argument set to
+ * the lowest/highest array value in passing.
+ */
+static void
+_bt_scankey_set_low_or_high(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ bool low_not_high)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (array->num_elems != -1)
+ {
+ /* set low or high element for conventional array */
+ int set_elem = 0;
+
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+
+ if (!low_not_high)
+ set_elem = array->num_elems - 1;
+
+ /*
+ * Just copy over array datum (only skip arrays require freeing and
+ * allocating memory for sk_argument)
+ */
+ array->cur_elem = set_elem;
+ skey->sk_argument = array->elem_values[set_elem];
+
+ return;
+ }
+
+ /* set low or high element for skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Clear possibly-irrelevant flags */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY);
+
+ if (array->null_elem &&
+ (low_not_high == ((skey->sk_flags & SK_BT_NULLS_FIRST) != 0)))
+ {
+ /* Lowest (or highest) element is NULL, so set scan key to NULL */
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+ }
+ else if (low_not_high)
+ {
+ /* Lowest array element isn't NULL */
+ if (array->use_sksup && !array->low_compare)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= SK_BT_NEG_INF;
+ }
+ else
+ {
+ /* Highest array element isn't NULL */
+ if (array->use_sksup && !array->high_compare)
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= SK_BT_POS_INF;
+ }
+}
+
+/*
+ * _bt_scankey_set_element() -- Set skip array scan key's sk_argument
+ *
+ * Sets scan key to "IS NULL" when required, and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull)
+{
+ /* tupdatum within the range of low_value/high_value */
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(tupnull && !array->null_elem));
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY);
+
+ /*
+ * Treat tupdatum/tupnull as a matching array element.
+ *
+ * We just copy tupdatum into the array's scan key (there is no
+ * conventional array element for us to set, of course).
+ *
+ * Unlike standard arrays, skip arrays sometimes need to locate NULLs.
+ * Treat them as just another value from the domain of indexed values.
+ */
+ if (!tupnull)
+ skey->sk_argument = datumCopy(tupdatum, attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
+/*
+ * _bt_scankey_unset_isnull() -- increment/decrement scan key from NULL
+ *
+ * Unsets scan key's "IS NULL" marking, and sets the non-NULL value from the
+ * array immediately before (or immediate after) NULL in the key space.
+ */
+static void
+_bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(skey->sk_flags & SK_SEARCHNULL);
+ Assert(skey->sk_flags & SK_ISNULL);
+ Assert(!(skey->sk_flags & (SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY)));
+ Assert(skey->sk_argument == 0);
+ Assert(array->use_sksup && array->null_elem &&
+ !array->low_compare && !array->high_compare);
+
+ /*
+ * sk_argument must be set to whatever non-NULL value comes immediately
+ * before or after NULL
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY);
+ if (skey->sk_flags & SK_BT_NULLS_FIRST)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+}
+
+/*
+ * _bt_scankey_set_isnull() -- decrement/increment scan key to NULL
+ */
+static void
+_bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & (SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY)));
+ Assert(array->null_elem);
+ Assert(!array->low_compare && !array->high_compare);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set sk_argument to NULL */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
+/*
+ * _bt_scankey_decrement() -- decrement array scan key's sk_argument
+ *
+ * Return value indicates whether caller's array was successfully decremented.
+ * Cannot decrement an array whose current element is already the first one.
+ */
+static bool
+_bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ bool underflow = false;
+ Datum dec_sk_argument;
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & (SK_BT_POS_INF | SK_BT_NEXTKEY | SK_BT_PREVKEY)));
+
+ /* Regular (non-skip) array? */
+ if (array->num_elems != -1)
+ {
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ if (array->cur_elem > 0)
+ {
+ /*
+ * Just copy over array datum (only skip arrays require freeing
+ * and allocating memory for sk_argument)
+ */
+ array->cur_elem--;
+ skey->sk_argument = array->elem_values[array->cur_elem];
+
+ /* Successfully decremented array */
+ return true;
+ }
+
+ /* Cannot decrement to before first array element */
+ return false;
+ }
+
+ /* Nope, this is a skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ /* The sentinel value -inf is never decrementable */
+ if (skey->sk_flags & SK_BT_NEG_INF)
+ return false;
+
+ /*
+ * When the current array element is NULL, and the lowest sorting value in
+ * the index is also NULL, we cannot decrement before first array element
+ */
+ if ((skey->sk_flags & SK_ISNULL) && (skey->sk_flags & SK_BT_NULLS_FIRST))
+ return false;
+
+ /*
+ * Opclasses without skip support "decrement" the scan key's current
+ * element by setting the PREVKEY flag. The true previous value can only
+ * be determined when the scan reads lower sorting tuples.
+ */
+ if (!array->use_sksup)
+ {
+ /*
+ * Determine as best we can (given the lack of skip support) whether
+ * the previous element will turn out to be out of bounds for the skip
+ * array.
+ *
+ * Skip arrays (that lack skip support) can only do this when their
+ * low_compare is for an >= inequality; if the current array element
+ * is == the inequality's sk_argument, then the true previous value
+ * cannot possibly satisfy low_compare. We can give up right away.
+ */
+ if (array->low_compare &&
+ array->low_compare->sk_strategy == BTGreaterEqualStrategyNumber &&
+ _bt_compare_array_skey(&array->order_low,
+ array->low_compare->sk_argument, false,
+ skey->sk_argument, false,
+ skey) == 0)
+ return false;
+
+ /* else the scan must figure out the true previous value */
+ skey->sk_flags |= SK_BT_PREVKEY;
+ return true;
+ }
+
+ /*
+ * Opclasses with skip support decrement the scan key's current element
+ * using a callback
+ */
+ if (skey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Existing sk_argument/array element is NULL (for an IS NULL qual).
+ *
+ * "Decrement" current array element to the high_elem value provided
+ * by opclass skip support routine.
+ */
+ _bt_scankey_unset_isnull(rel, skey, array);
+ return true;
+ }
+
+ /*
+ * Ask opclass support routine to provide decremented copy of existing
+ * non-NULL sk_argument
+ */
+ dec_sk_argument = _bt_skipsupport_decrement(rel, skey, array, &underflow);
+
+ if (underflow)
+ {
+ if (array->null_elem && (skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to non-NULL low_elem
+ * provided by opclass skip support routine, but skip array's true
+ * lowest element is actually NULL.
+ *
+ * "Decrement" sk_argument to NULL.
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+ return true;
+ }
+
+ /* Cannot decrement before first array element */
+ return false;
+ }
+
+ /*
+ * Make sure that the decremented value is within the range of the skip
+ * array
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ dec_sk_argument,
+ array->low_compare->sk_argument)))
+ {
+ /* decremented value is out of bounds for range skip array */
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(dec_sk_argument));
+ return false;
+ }
+
+ /* Accept non-NULL datum value from opclass decrement callback */
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = dec_sk_argument;
+
+ return true;
+}
+
+/*
+ * _bt_scankey_increment() -- increment array scan key's sk_argument
+ *
+ * Return value indicates whether caller's array was successfully incremented.
+ * Cannot increment an array whose current element is already the final one.
+ */
+static bool
+_bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ bool overflow = false;
+ Datum inc_sk_argument;
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & (SK_BT_NEG_INF | SK_BT_NEXTKEY | SK_BT_PREVKEY)));
+
+ /* Regular (non-skip) array? */
+ if (array->num_elems != -1)
+ {
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ if (array->cur_elem < array->num_elems - 1)
+ {
+ /*
+ * Just copy over array datum (only skip arrays require freeing
+ * and allocating memory for sk_argument)
+ */
+ array->cur_elem++;
+ skey->sk_argument = array->elem_values[array->cur_elem];
+
+ /* Successfully incremented array */
+ return true;
+ }
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /* Nope, this is a skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ /* The sentinel value +inf is never incrementable */
+ if (skey->sk_flags & SK_BT_POS_INF)
+ return false;
+
+ /*
+ * When the current array element is NULL, and the highest sorting value
+ * in the index is also NULL, we cannot increment past the final element
+ */
+ if ((skey->sk_flags & SK_ISNULL) && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ return false;
+
+ /*
+ * Opclasses without skip support "increment" the scan key's current
+ * element by setting the NEXTKEY flag. The true previous value can only
+ * be determined when the scan reads higher sorting tuples.
+ */
+ if (!array->use_sksup)
+ {
+ /*
+ * Determine as best we can (given the lack of skip support) whether
+ * the next element will turn out to be out of bounds for the skip
+ * array.
+ *
+ * Skip arrays (that lack skip support) can only do this when their
+ * high_compare is for an <= inequality; if the current array element
+ * is == the inequality's sk_argument, then the true next value cannot
+ * possibly satisfy high_compare. We can give up right away.
+ */
+ if (array->high_compare &&
+ array->high_compare->sk_strategy == BTLessEqualStrategyNumber &&
+ _bt_compare_array_skey(&array->order_high,
+ array->high_compare->sk_argument, false,
+ skey->sk_argument, false,
+ skey) == 0)
+ return false;
+
+ /* else the scan must figure out the true next value */
+ skey->sk_flags |= SK_BT_NEXTKEY;
+ return true;
+ }
+
+ /*
+ * Opclasses with skip support increment the scan key's current element
+ * using a callback
+ */
+ if (skey->sk_flags & SK_ISNULL)
+ {
+ /*
+ * Existing sk_argument/array element is NULL (for an IS NULL qual).
+ *
+ * "Increment" current array element to the low_elem value provided by
+ * opclass skip support routine.
+ */
+ _bt_scankey_unset_isnull(rel, skey, array);
+ return true;
+ }
+
+ /*
+ * Ask opclass support routine to provide incremented copy of existing
+ * non-NULL sk_argument
+ */
+ inc_sk_argument = _bt_skipsupport_increment(rel, skey, array, &overflow);
+
+ if (overflow)
+ {
+ if (array->null_elem && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to non-NULL high_elem
+ * provided by opclass skip support routine, but skip array's true
+ * highest element is actually NULL.
+ *
+ * "Decrement" sk_argument to NULL.
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+ return true;
+ }
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /*
+ * Make sure that the incremented value is within the range of the skip
+ * array
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ inc_sk_argument,
+ array->high_compare->sk_argument)))
+ {
+ /* incremented value is out of bounds for range skip array */
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(inc_sk_argument));
+ return false;
+ }
+
+ /* Accept non-NULL datum value from opclass increment callback */
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = inc_sk_argument;
+
+ return true;
+}
+
/*
* _bt_advance_array_keys_increment() -- Advance to next set of array elements
*
@@ -1380,6 +2506,7 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
static bool
_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
/*
@@ -1389,29 +2516,30 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
*/
for (int i = so->numArrayKeys - 1; i >= 0; i--)
{
- BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
- ScanKey skey = &so->keyData[curArrayKey->scan_key];
- int cur_elem = curArrayKey->cur_elem;
- int num_elems = curArrayKey->num_elems;
- bool rolled = false;
+ BTArrayKeyInfo *array = &so->arrayKeys[i];
+ ScanKey skey = &so->keyData[array->scan_key];
- if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
+ if (ScanDirectionIsForward(dir))
{
- cur_elem = 0;
- rolled = true;
+ if (_bt_scankey_increment(rel, skey, array))
+ return true;
}
- else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
+ else
{
- cur_elem = num_elems - 1;
- rolled = true;
+ if (_bt_scankey_decrement(rel, skey, array))
+ return true;
}
- curArrayKey->cur_elem = cur_elem;
- skey->sk_argument = curArrayKey->elem_values[cur_elem];
- if (!rolled)
- return true;
+ /*
+ * Handle array roll over.
+ *
+ * Start over at the array's lowest sorting value (or its highest
+ * value, for backward scans)...
+ */
+ _bt_scankey_set_low_or_high(rel, skey, array,
+ ScanDirectionIsForward(dir));
- /* Need to advance next array key, if any */
+ /* ...then advance next most significant array, if any */
}
/*
@@ -1466,6 +2594,7 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
static void
_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int arrayidx = 0;
@@ -1473,7 +2602,6 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
ScanKey cur = so->keyData + ikey;
BTArrayKeyInfo *array = NULL;
- int first_elem_dir;
if (!(cur->sk_flags & SK_SEARCHARRAY) ||
cur->sk_strategy != BTEqualStrategyNumber)
@@ -1485,16 +2613,10 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
continue;
- if (ScanDirectionIsForward(dir))
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
+ Assert(array->num_elems != -1); /* No skipping of non-required arrays */
- if (array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -1539,11 +2661,22 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
* the page to the right of caller's finaltup/high key tuple instead). It's
* only possible that we'll set *scanBehind to true when caller passes us a
* pivot tuple (with truncated -inf attributes) that we return false for.
+ *
+ * When a skip array sets its scan key to -inf (or to +inf in the case of a
+ * backwards scan), the tuple will never be before the scan's current array
+ * keys on the basis of that particular scan key/tuple attribute value.
+ * However, some caller's (infbefore callers) need us to resolve such a
+ * comparison by treating the -inf/+inf value as coming before every other
+ * value instead (before relative to the current scan direction). This scheme
+ * allows _bt_advance_array_keys to schedule the next primitive index scan
+ * when the page's finaltup has no values within the range of a range skip
+ * array, iff no earlier scan key triggered the next primitive scan first.
*/
static bool
_bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
- bool readpagetup, int sktrig, bool *scanBehind)
+ bool readpagetup, int sktrig, bool *scanBehind,
+ bool infbefore)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1558,6 +2691,8 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
{
ScanKey cur = so->keyData + ikey;
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
Datum tupdatum;
bool tupnull;
int32 result;
@@ -1617,11 +2752,27 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
continue;
}
+ /*
+ * When scan key is marked NEG_INF, the current array element is lower
+ * than any possible indexable value (or it's lower than any possible
+ * value that satisfies the array's low_compare > or >= inequality).
+ *
+ * Similarly, when scan key is marked POS_INF, the current element is
+ * higher than any possible indexable value (or it's higher than any
+ * value satisfying the array's high_compare < or <= inequality).
+ */
+ if (cur->sk_flags & (SK_BT_NEG_INF | SK_BT_POS_INF))
+ {
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(cur->sk_argument == 0);
+ return infbefore;
+ }
+
tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull, cur);
/*
* Does this comparison indicate that caller must _not_ advance the
@@ -1631,6 +2782,19 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
(ScanDirectionIsBackward(dir) && result > 0))
return true;
+ /*
+ * When scan key is marked NEXTKEY, the current array element is
+ * "sk_argument + infinitesimal" (with PREVKEY the current element is
+ * "sk_argument - infinitesimal" instead). In other words, its value
+ * comes immediately after (or immediately before) sk_argument in the
+ * key space.
+ */
+ if ((cur->sk_flags & (SK_BT_NEXTKEY | SK_BT_PREVKEY)) && result == 0)
+ {
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ return true;
+ }
+
/*
* Does this comparison indicate that caller should now advance the
* scan's arrays? (Must be if we get here during a readpagetup call.)
@@ -1806,7 +2970,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
* Precondition array state assertion
*/
Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
- tupnatts, false, 0, NULL));
+ tupnatts, false, 0, NULL, false));
so->scanBehind = false; /* reset */
@@ -1954,18 +3118,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (beyond_end_advance)
{
- int final_elem_dir;
-
- if (ScanDirectionIsBackward(dir) || !array)
- final_elem_dir = 0;
- else
- final_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != final_elem_dir)
- {
- array->cur_elem = final_elem_dir;
- cur->sk_argument = array->elem_values[final_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
continue;
}
@@ -1990,18 +3145,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (!all_required_satisfied || cur->sk_attno > tupnatts)
{
- int first_elem_dir;
-
- if (ScanDirectionIsForward(dir) || !array)
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
continue;
}
@@ -2019,15 +3165,26 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
/*
* Binary search for closest match that's available from the array
*/
- set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
- cur_elem_trig, dir,
- tupdatum, tupnull, array, cur,
- &result);
+ if (array->num_elems != -1)
+ set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
- Assert(set_elem >= 0 && set_elem < array->num_elems);
+ /*
+ * Skip array. "Binary search" by checking if tupdatum/tupnull
+ * are within the low_value/high_value range of the skip array.
+ */
+ else
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey],
+ tupdatum, tupnull, array, cur,
+ &result);
}
else
{
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
+
Assert(sktrig_required && required);
/*
@@ -2041,7 +3198,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull, cur);
}
/*
@@ -2100,11 +3257,62 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
}
}
- /* Advance array keys, even when set_elem isn't an exact match */
- if (array && array->cur_elem != set_elem)
+ /* Advance array keys, even when we don't have an exact match */
+
+ if (!array)
+ continue; /* no element to set in non-array */
+
+ /* Conventional arrays have a valid set_elem for us to advance to */
+ if (array->num_elems != -1)
{
- array->cur_elem = set_elem;
- cur->sk_argument = array->elem_values[set_elem];
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ cur->sk_argument = array->elem_values[set_elem];
+ }
+
+ continue;
+ }
+
+ /*
+ * Conceptually, skip arrays also have array elements. The actual
+ * elements/values are generated procedurally and on demand.
+ */
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+ Assert(required);
+
+ if (result == 0)
+ {
+ /*
+ * Anything within the range of possible element values is treated
+ * as "a match for one of the array's elements". Store the next
+ * scan key argument value by taking a copy of the tupdatum value
+ * from caller's tuple (or set scan key IS NULL when tupnull, iff
+ * the array's range of possible elements covers NULL).
+ */
+ _bt_scankey_set_element(rel, cur, array, tupdatum, tupnull);
+ }
+ else if (beyond_end_advance)
+ {
+ /*
+ * We need to set the array element to the final "element" in the
+ * current scan direction for "beyond end of array element" array
+ * advancement. See above for an explanation.
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
+ }
+ else
+ {
+ /*
+ * The closest matching element is the lowest element; even that
+ * still puts us ahead of caller's tuple in the key space. This
+ * process has to carry to any lower-order arrays. See above for
+ * an explanation.
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -2234,7 +3442,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
* scan direction to deal with NULLs. We'll account for that separately.)
*/
Assert(_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts,
- false, 0, NULL) ==
+ false, 0, NULL, true) ==
!all_required_satisfied);
/*
@@ -2259,7 +3467,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
if (!all_required_satisfied && pstate->finaltup &&
_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
BTreeTupleGetNAtts(pstate->finaltup, rel),
- false, 0, &so->scanBehind))
+ false, 0, &so->scanBehind, true))
goto new_prim_scan;
/*
@@ -2460,10 +3668,12 @@ end_toplevel_scan:
/*
* _bt_preprocess_keys() -- Preprocess scan keys
*
+ * The first call here (per btrescan) allocates so->keyData[].
* The given search-type keys (taken from scan->keyData[])
* are copied to so->keyData[] with possible transformation.
* scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
- * the number of output keys (possibly less, never greater).
+ * the number of output keys. Calling here a second or subsequent time
+ * (during the same btrescan) is a no-op.
*
* The output keys are marked with additional sk_flags bits beyond the
* system-standard bits supplied by the caller. The DESC and NULLS_FIRST
@@ -2483,6 +3693,8 @@ end_toplevel_scan:
* within each attribute may be done as a byproduct of the processing here.
* That process must leave array scan keys (within an attribute) in the same
* order as corresponding entries from the scan's BTArrayKeyInfo array info.
+ * We might also cons up skip array scan keys that weren't present in the
+ * original input keys; these are also output in standard attribute order.
*
* The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
* if they must be satisfied in order to continue the scan forward or backward
@@ -2550,9 +3762,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
int16 *indoption = scan->indexRelation->rd_indoption;
int new_numberOfKeys;
int numberOfEqualCols;
- ScanKey inkeys;
- ScanKey outkeys;
- ScanKey cur;
+ ScanKey inputsk;
BTScanKeyPreproc xform[BTMaxStrategyNumber];
bool test_result;
int i,
@@ -2584,7 +3794,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
return; /* done if qual-less scan */
/* If any keys are SK_SEARCHARRAY type, set up array-key info */
- arrayKeyData = _bt_preprocess_array_keys(scan);
+ arrayKeyData = _bt_preprocess_array_keys(scan, &numberOfKeys);
if (!so->qual_ok)
{
/* unmatchable array, so give up */
@@ -2598,32 +3808,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
if (arrayKeyData)
{
- inkeys = arrayKeyData;
+ inputsk = arrayKeyData;
/* Also maintain keyDataMap for remapping so->orderProc[] later */
keyDataMap = MemoryContextAlloc(so->arrayContext,
numberOfKeys * sizeof(int));
}
else
- inkeys = scan->keyData;
+ inputsk = scan->keyData;
+
+ /*
+ * Now that we have an estimate of the number of output scan keys
+ * (including any skip array scan keys), allocate space for them
+ */
+ so->keyData = palloc(sizeof(ScanKeyData) * numberOfKeys);
- outkeys = so->keyData;
- cur = &inkeys[0];
/* we check that input keys are correctly ordered */
- if (cur->sk_attno < 1)
+ if (inputsk->sk_attno < 1)
elog(ERROR, "btree index keys must be ordered by attribute");
/* We can short-circuit most of the work if there's just one key */
if (numberOfKeys == 1)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
so->qual_ok = false;
- memcpy(outkeys, cur, sizeof(ScanKeyData));
+ memcpy(&so->keyData[0], inputsk, sizeof(ScanKeyData));
so->numberOfKeys = 1;
/* We can mark the qual as required if it's for first index col */
- if (cur->sk_attno == 1)
- _bt_mark_scankey_required(outkeys);
+ if (inputsk->sk_attno == 1)
+ _bt_mark_scankey_required(&so->keyData[0]);
if (arrayKeyData)
{
/*
@@ -2631,8 +3845,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
* (we'll miss out on the single value array transformation, but
* that's not nearly as important when there's only one scan key)
*/
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+ Assert(so->keyData[0].sk_flags & SK_SEARCHARRAY);
+ Assert(so->keyData[0].sk_strategy != BTEqualStrategyNumber ||
(so->arrayKeys[0].scan_key == 0 &&
OidIsValid(so->orderProcs[0].fn_oid)));
}
@@ -2660,12 +3874,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* handle after-last-key processing. Actual exit from the loop is at the
* "break" statement below.
*/
- for (i = 0;; cur++, i++)
+ for (i = 0;; inputsk++, i++)
{
if (i < numberOfKeys)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
{
/* NULL can't be matched, so give up */
so->qual_ok = false;
@@ -2677,12 +3891,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* If we are at the end of the keys for a particular attr, finish up
* processing and emit the cleaned-up keys.
*/
- if (i == numberOfKeys || cur->sk_attno != attno)
+ if (i == numberOfKeys || inputsk->sk_attno != attno)
{
int priorNumberOfEqualCols = numberOfEqualCols;
/* check input keys are correctly ordered */
- if (i < numberOfKeys && cur->sk_attno < attno)
+ if (i < numberOfKeys && inputsk->sk_attno < attno)
elog(ERROR, "btree index keys must be ordered by attribute");
/*
@@ -2741,7 +3955,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
return;
}
/* else discard the redundant non-equality key */
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
xform[j].skey = NULL;
xform[j].ikey = -1;
}
@@ -2786,7 +4001,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
}
/*
- * Emit the cleaned-up keys into the outkeys[] array, and then
+ * Emit the cleaned-up keys into the so->keyData[] array, and then
* mark them if they are required. They are required (possibly
* only in one direction) if all attrs before this one had "=".
*/
@@ -2794,7 +4009,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
{
if (xform[j].skey)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
@@ -2811,19 +4026,19 @@ _bt_preprocess_keys(IndexScanDesc scan)
break;
/* Re-initialize for new attno */
- attno = cur->sk_attno;
+ attno = inputsk->sk_attno;
memset(xform, 0, sizeof(xform));
}
/* check strategy this key's operator corresponds to */
- j = cur->sk_strategy - 1;
+ j = inputsk->sk_strategy - 1;
/* if row comparison, push it directly to the output array */
- if (cur->sk_flags & SK_ROW_HEADER)
+ if (inputsk->sk_flags & SK_ROW_HEADER)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
- memcpy(outkey, cur, sizeof(ScanKeyData));
+ memcpy(outkey, inputsk, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = i;
if (numberOfEqualCols == attno - 1)
@@ -2837,19 +4052,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
continue;
}
- /*
- * Does this input scan key require further processing as an array?
- */
- if (cur->sk_strategy == InvalidStrategy)
- {
- /* _bt_preprocess_array_keys marked this array key redundant */
- Assert(arrayKeyData);
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- continue;
- }
-
- if (cur->sk_strategy == BTEqualStrategyNumber &&
- (cur->sk_flags & SK_SEARCHARRAY))
+ if (inputsk->sk_strategy == BTEqualStrategyNumber &&
+ (inputsk->sk_flags & SK_SEARCHARRAY))
{
/* _bt_preprocess_array_keys kept this array key */
Assert(arrayKeyData);
@@ -2863,7 +4067,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (xform[j].skey == NULL)
{
/* nope, so this scan key wins by default (at least for now) */
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2881,7 +4085,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
/*
* Have to set up array keys
*/
- if ((cur->sk_flags & SK_SEARCHARRAY))
+ if (inputsk->sk_flags & SK_SEARCHARRAY)
{
array = &so->arrayKeys[arrayidx - 1];
orderproc = so->orderProcs + i;
@@ -2909,13 +4113,14 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
}
- if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+ if (_bt_compare_scankey_args(scan, inputsk, inputsk, xform[j].skey,
array, orderproc, &test_result))
{
/* Have all we need to determine redundancy */
if (test_result)
{
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
/*
* New key is more restrictive, and so replaces old key...
@@ -2923,7 +4128,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2936,7 +4141,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
* scan key. _bt_compare_scankey_args expects us to
* always keep arrays (and discard non-arrays).
*/
- Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+ Assert(!(inputsk->sk_flags & SK_SEARCHARRAY));
}
}
else if (j == (BTEqualStrategyNumber - 1))
@@ -2959,14 +4164,14 @@ _bt_preprocess_keys(IndexScanDesc scan)
* even with incomplete opfamilies. _bt_advance_array_keys
* depends on this.
*/
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
if (numberOfEqualCols == attno - 1)
_bt_mark_scankey_required(outkey);
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -3057,10 +4262,11 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
if (array->scan_key != ikey)
return false;
- if (array->num_elems <= 0)
+ if (array->num_elems == 0 || array->num_elems < -1)
return false;
- if (cur->sk_argument != array->elem_values[array->cur_elem])
+ if (array->num_elems != -1 &&
+ cur->sk_argument != array->elem_values[array->cur_elem])
return false;
if (last_sk_attno > cur->sk_attno)
return false;
@@ -3135,6 +4341,22 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool leftnull,
rightnull;
+ /* Handle skip array comparison with IS NOT NULL scan key */
+ if ((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP)
+ {
+ /* Shouldn't generate skip array in presence of IS NULL key */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNULL));
+ Assert((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNOTNULL);
+
+ /* Skip array will have no NULL element/IS NULL scan key */
+ Assert(array->num_elems == -1);
+ array->null_elem = false;
+
+ /* IS NOT NULL key (could be leftarg or rightarg) now redundant */
+ *result = true;
+ return true;
+ }
+
if (leftarg->sk_flags & SK_ISNULL)
{
Assert(leftarg->sk_flags & (SK_SEARCHNULL | SK_SEARCHNOTNULL));
@@ -3208,6 +4430,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
{
/* Can't make the comparison */
*result = false; /* suppress compiler warnings */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
return false;
}
@@ -3380,13 +4603,6 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
return true;
}
- if (skey->sk_strategy == InvalidStrategy)
- {
- /* Already-eliminated array scan key; don't need to fix anything */
- Assert(skey->sk_flags & SK_SEARCHARRAY);
- return true;
- }
-
/* Adjust strategy for DESC, if we didn't already */
if ((addflags & SK_BT_DESC) && !(skey->sk_flags & SK_BT_DESC))
skey->sk_strategy = BTCommuteStrategyNumber(skey->sk_strategy);
@@ -3524,7 +4740,7 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
*/
Assert(!so->scanBehind && !pstate->prechecked && !pstate->firstmatch);
Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
- tupnatts, false, 0, NULL));
+ tupnatts, false, 0, NULL, false));
}
if (pstate->prechecked || pstate->firstmatch)
{
@@ -3560,7 +4776,7 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
* tuples matching the current set of array keys. Check for that first.
*/
if (_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc, tupnatts, true,
- ikey, NULL))
+ ikey, NULL, false))
{
/*
* Tuple is still before the start of matches according to the scan's
@@ -3579,7 +4795,7 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
_bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, tupdesc,
BTreeTupleGetNAtts(pstate->finaltup,
scan->indexRelation),
- false, 0, NULL))
+ false, 0, NULL, false))
{
/* Cut our losses -- start a new primitive index scan now */
pstate->continuescan = false;
@@ -3734,6 +4950,21 @@ _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
continue;
}
+ /*
+ * A skip array scan key might be negative/positive infinity. Might
+ * also be next key/previous key sentinel, which we don't deal with.
+ */
+ if (key->sk_flags & (SK_BT_NEG_INF | SK_BT_POS_INF |
+ SK_BT_NEXTKEY | SK_BT_PREVKEY))
+ {
+ Assert(key->sk_flags & SK_SEARCHARRAY);
+ Assert(key->sk_flags & SK_BT_SKIP);
+ Assert(requiredSameDir);
+
+ *continuescan = false;
+ return false;
+ }
+
/* row-comparison keys need special processing */
if (key->sk_flags & SK_ROW_HEADER)
{
@@ -4105,7 +5336,7 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
ahead = (IndexTuple) PageGetItem(pstate->page,
PageGetItemId(pstate->page, aheadoffnum));
if (_bt_tuple_before_array_skeys(scan, dir, ahead, tupdesc, tupnatts,
- false, 0, NULL))
+ false, 0, NULL, false))
{
/*
* Success -- instruct _bt_readpage to skip ahead to very next tuple
diff --git a/src/backend/access/nbtree/nbtvalidate.c b/src/backend/access/nbtree/nbtvalidate.c
index e9d4cd60d..96d0d9185 100644
--- a/src/backend/access/nbtree/nbtvalidate.c
+++ b/src/backend/access/nbtree/nbtvalidate.c
@@ -114,6 +114,10 @@ btvalidate(Oid opclassoid)
case BTOPTIONS_PROC:
ok = check_amoptsproc_signature(procform->amproc);
break;
+ case BTSKIPSUPPORT_PROC:
+ ok = check_amproc_signature(procform->amproc, VOIDOID, true,
+ 1, 1, INTERNALOID);
+ break;
default:
ereport(INFO,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index b8b5c147c..a86dbf71b 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -1330,6 +1330,31 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("btree equal image functions must not be cross-type")));
}
+ else if (member->number == BTSKIPSUPPORT_PROC)
+ {
+ if (procform->pronargs != 1 ||
+ procform->proargtypes.values[0] != INTERNALOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must accept type \"internal\"")));
+ if (procform->prorettype != VOIDOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must return void")));
+
+ /*
+ * pg_amproc functions are indexed by (lefttype, righttype), but a
+ * skip support function doesn't make sense in cross-type
+ * scenarios. The same opclass opcintype OID is always used for
+ * lefttype and righttype. Providing a cross-type routine isn't
+ * sensible. Reject cross-type ALTER OPERATOR FAMILY ... ADD
+ * FUNCTION 6 statements here.
+ */
+ if (member->lefttype != member->righttype)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must not be cross-type")));
+ }
}
else if (amoid == HASH_AM_OID)
{
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index edb09d4e3..e945686c8 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -96,6 +96,7 @@ OBJS = \
rowtypes.o \
ruleutils.o \
selfuncs.o \
+ skipsupport.o \
tid.o \
timestamp.o \
trigfuncs.o \
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 9c854e0e5..79658f068 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -34,6 +34,7 @@
#include "utils/date.h"
#include "utils/datetime.h"
#include "utils/numeric.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
/*
@@ -455,6 +456,49 @@ date_sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+date_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ if (dexisting == DATEVAL_NOBEGIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return DateADTGetDatum(dexisting - 1);
+}
+
+static Datum
+date_increment(Relation rel, Datum existing, bool *overflow)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ if (dexisting == DATEVAL_NOEND)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return DateADTGetDatum(dexisting + 1);
+}
+
+Datum
+date_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = date_decrement;
+ sksup->increment = date_increment;
+ sksup->low_elem = DateADTGetDatum(DATEVAL_NOBEGIN);
+ sksup->high_elem = DateADTGetDatum(DATEVAL_NOEND);
+
+ PG_RETURN_VOID();
+}
+
Datum
date_finite(PG_FUNCTION_ARGS)
{
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index 8c6fc80c3..91682edd5 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -83,6 +83,7 @@ backend_sources += files(
'rowtypes.c',
'ruleutils.c',
'selfuncs.c',
+ 'skipsupport.c',
'tid.c',
'timestamp.c',
'trigfuncs.c',
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d..33b1722df 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6800,6 +6800,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
List *indexBoundQuals;
int indexcol;
bool eqQualHere;
+ bool found_skip;
bool found_saop;
bool found_is_null_op;
double num_sa_scans;
@@ -6825,6 +6826,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
+ found_skip = false;
found_saop = false;
found_is_null_op = false;
num_sa_scans = 1;
@@ -6833,15 +6835,38 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
IndexClause *iclause = lfirst_node(IndexClause, lc);
ListCell *lc2;
+ /*
+ * XXX For now we just cost skip scans via generic rules: make a
+ * uniform assumption that there will be 10 primitive index scans per
+ * skipped attribute, relying on the "1/3 of all index pages" cap that
+ * this costing has used since Postgres 17. Also assume that skipping
+ * won't take place for an index that has fewer than 100 pages.
+ *
+ * The current approach to costing leaves much to be desired, but is
+ * at least better than nothing at all (keeping the code as it is on
+ * HEAD just makes testing and review inconvenient).
+ */
if (indexcol != iclause->indexcol)
{
/* Beginning of a new column's quals */
if (!eqQualHere)
- break; /* done if no '=' qual for indexcol */
+ {
+ found_skip = true; /* skip when no '=' qual for indexcol */
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10;
+ }
eqQualHere = false;
indexcol++;
if (indexcol != iclause->indexcol)
- break; /* no quals at all for indexcol */
+ {
+ /* no quals at all for indexcol */
+ found_skip = true;
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10 * (iclause->indexcol - indexcol);
+ continue;
+ }
}
/* Examine each indexqual associated with this index clause */
@@ -6914,6 +6939,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (index->unique &&
indexcol == index->nkeycolumns - 1 &&
eqQualHere &&
+ !found_skip &&
!found_saop &&
!found_is_null_op)
numIndexTuples = 1.0;
diff --git a/src/backend/utils/adt/skipsupport.c b/src/backend/utils/adt/skipsupport.c
new file mode 100644
index 000000000..796e998a9
--- /dev/null
+++ b/src/backend/utils/adt/skipsupport.c
@@ -0,0 +1,52 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.c
+ * Support routines for B-Tree skip scan.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/skipsupport.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "utils/lsyscache.h"
+#include "utils/skipsupport.h"
+
+/*
+ * Fill in SkipSupport given an operator class (opfamily + opcintype).
+ *
+ * On success, returns true, and initializes all SkipSupport fields for
+ * caller. Otherwise returns false, indicating that operator class has no
+ * skip support function.
+ */
+bool
+PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype, bool reverse,
+ SkipSupport sksup)
+{
+ Oid skipSupportFunction;
+
+ /* Look for a skip support function */
+ skipSupportFunction = get_opfamily_proc(opfamily, opcintype, opcintype,
+ BTSKIPSUPPORT_PROC);
+ if (!OidIsValid(skipSupportFunction))
+ return false;
+
+ OidFunctionCall1(skipSupportFunction, PointerGetDatum(sksup));
+
+ if (reverse)
+ {
+ Datum low_elem = sksup->low_elem;
+
+ sksup->low_elem = sksup->high_elem;
+ sksup->high_elem = low_elem;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c
index 45eb1b2fe..e2d98a62f 100644
--- a/src/backend/utils/adt/uuid.c
+++ b/src/backend/utils/adt/uuid.c
@@ -13,12 +13,15 @@
#include "postgres.h"
+#include <limits.h>
+
#include "common/hashfn.h"
#include "lib/hyperloglog.h"
#include "libpq/pqformat.h"
#include "port/pg_bswap.h"
#include "utils/fmgrprotos.h"
#include "utils/guc.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#include "utils/timestamp.h"
#include "utils/uuid.h"
@@ -390,6 +393,70 @@ uuid_abbrev_convert(Datum original, SortSupport ssup)
return res;
}
+static Datum
+uuid_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ *underflow = false;
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] > 0)
+ {
+ uuid->data[i]--;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = UCHAR_MAX;
+ }
+
+ *underflow = true;
+
+ return 0;
+}
+
+static Datum
+uuid_increment(Relation rel, Datum existing, bool *overflow)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ *overflow = false;
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] < UCHAR_MAX)
+ {
+ uuid->data[i]++;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = 0;
+ }
+
+ *overflow = true;
+
+ return 0;
+}
+
+Datum
+uuid_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+ pg_uuid_t *uuid_min = palloc(UUID_LEN);
+ pg_uuid_t *uuid_max = palloc(UUID_LEN);
+
+ memset(uuid_min->data, 0x00, UUID_LEN);
+ memset(uuid_max->data, 0xFF, UUID_LEN);
+
+ sksup->decrement = uuid_decrement;
+ sksup->increment = uuid_increment;
+ sksup->low_elem = UUIDPGetDatum(uuid_min);
+ sksup->high_elem = UUIDPGetDatum(uuid_max);
+
+ PG_RETURN_VOID();
+}
+
/* hash index support */
Datum
uuid_hash(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 630ed0f16..6fc3ca1a7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/nbtree.h"
#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
@@ -1702,6 +1703,17 @@ struct config_bool ConfigureNamesBool[] =
},
#endif
+ /* XXX Remove before commit */
+ {
+ {"skipscan_skipsupport_enabled", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_skipsupport_enabled,
+ true,
+ NULL, NULL, NULL
+ },
+
{
{"integer_datetimes", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows whether datetimes are integer based."),
@@ -3525,6 +3537,17 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ /* XXX Remove before commit */
+ {
+ {"skipscan_prefix_cols", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_prefix_cols,
+ INDEX_MAX_KEYS, 0, INDEX_MAX_KEYS,
+ NULL, NULL, NULL
+ },
+
{
/* Can't be set in postgresql.conf */
{"server_version_num", PGC_INTERNAL, PRESET_OPTIONS,
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 2b3997988..9662fb2ba 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -583,6 +583,19 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</para>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><function>skipsupport</function></term>
+ <listitem>
+ <para>
+ Optionally, a btree operator family may provide a <firstterm>skip
+ support</firstterm> function, registered under support function
+ number 6. These functions allow the B-tree code to more efficiently
+ navigate the index structure via an index <quote>skip scan</quote>. The
+ APIs involved in this are defined in
+ <filename>src/include/utils/skipsupport.h</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</sect2>
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 22d8ad1aa..f17dd3456 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -461,6 +461,13 @@
</entry>
<entry>5</entry>
</row>
+ <row>
+ <entry>
+ Return the addresses of C-callable skip support function(s)
+ (optional)
+ </entry>
+ <entry>6</entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -1056,7 +1063,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
FUNCTION 1 btint8cmp(int8, int8) ,
FUNCTION 2 btint8sortsupport(internal) ,
FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint8skipsupport(internal);
CREATE OPERATOR CLASS int4_ops
DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -1069,7 +1077,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
FUNCTION 1 btint4cmp(int4, int4) ,
FUNCTION 2 btint4sortsupport(internal) ,
FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint4skipsupport(internal);
CREATE OPERATOR CLASS int2_ops
DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1082,7 +1091,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
FUNCTION 1 btint2cmp(int2, int2) ,
FUNCTION 2 btint2sortsupport(internal) ,
FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint2skipsupport(internal);
ALTER OPERATOR FAMILY integer_ops USING btree ADD
-- cross-type comparisons int8 vs int2
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index ae54cb254..8b6b775c1 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -362,9 +362,9 @@ ERROR: invalid operator number 0, must be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ERROR: operator argument types must be specified in ALTER OPERATOR FAMILY
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ERROR: invalid function number 0, must be between 1 and 5
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR: invalid function number 6, must be between 1 and 5
+ERROR: invalid function number 0, must be between 1 and 6
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
+ERROR: invalid function number 7, must be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
ERROR: STORAGE cannot be specified in ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 3bbe4c5f9..a8d5be6c1 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5138,9 +5138,10 @@ List of access methods
btree | uuid_ops | uuid | uuid | 1 | uuid_cmp
btree | uuid_ops | uuid | uuid | 2 | uuid_sortsupport
btree | uuid_ops | uuid | uuid | 4 | btequalimage
+ btree | uuid_ops | uuid | uuid | 6 | uuid_skipsupport
hash | uuid_ops | uuid | uuid | 1 | uuid_hash
hash | uuid_ops | uuid | uuid | 2 | uuid_hash_extended
-(5 rows)
+(6 rows)
-- check \dconfig
set work_mem = 10240;
diff --git a/src/test/regress/sql/alter_generic.sql b/src/test/regress/sql/alter_generic.sql
index de58d268d..4246afefd 100644
--- a/src/test/regress/sql/alter_generic.sql
+++ b/src/test/regress/sql/alter_generic.sql
@@ -310,7 +310,7 @@ ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 6 < (int4, int2); -- ope
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 0 < (int4, int2); -- operator number should be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b4d7f9217..b5b5c2494 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -218,6 +218,7 @@ BTScanPos
BTScanPosData
BTScanPosItem
BTShared
+BTSkipPreproc
BTSortArrayContext
BTSpool
BTStack
@@ -2654,6 +2655,8 @@ SingleBoundSortItem
SinglePartitionSpec
Size
SkipPages
+SkipSupport
+SkipSupportData
SlabBlock
SlabContext
SlabSlot
--
2.45.2
On Wed, Jun 26, 2024 at 03:16:07PM GMT, Peter Geoghegan wrote:
Loose index scan is a far more specialized technique than skip scan.
It only applies within special scans that feed into a DISTINCT group
aggregate. Whereas my skip scan patch isn't like that at all -- it's
much more general. With my patch, nbtree has exactly the same contract
with the executor/core code as before. There are no new index paths
generated by the optimizer to make skip scan work, even. Skip scan
isn't particularly aimed at improving group aggregates (though the
benchmark I'll show happens to involve a group aggregate, simply
because the technique works best with large and expensive index
scans).
I see that the patch is not supposed to deal with aggregates in any special
way. But from what I understand after a quick review, skip scan is not getting
applied to them if there are no quals in the query (in that case
_bt_preprocess_keys returns before calling _bt_preprocess_array_keys). Yet such
queries could benefit from skipping, I assume they still could be handled by
the machinery introduced in this patch?
Currently, there is an assumption that "there will be 10 primitive index scans
per skipped attribute". Is any chance to use pg_stats.n_distinct?It probably makes sense to use pg_stats.n_distinct here. But how?
If the problem is that we're too pessimistic, then I think that this
will usually (though not always) make us more pessimistic. Isn't that
the wrong direction to go in? (We're probably also too optimistic in
some cases, but being too pessimistic is a bigger problem in
practice.)For example, your test case involved 11 distinct values in each
column. The current approach of hard-coding 10 (which is just a
temporary hack) should actually make the scan look a bit cheaper than
it would if we used the true ndistinct.Another underlying problem is that the existing SAOP costing really
isn't very accurate, without skip scan -- that's a big source of the
pessimism with arrays/skipping. Why should we be able to get the true
number of primitive index scans just by multiplying together each
omitted prefix column's ndistinct? That approach is good for getting
the worst case, which is probably relevant -- but it's probably not a
very good assumption for the average case. (Though at least we can cap
the total number of primitive index scans to 1/3 of the total number
of pages in the index in btcostestimate, since we have guarantees
about the worst case as of Postgres 17.)
Do I understand correctly, that the only way how multiplying ndistincts could
produce too pessimistic results is when there is a correlation between distinct
values? Can one benefit from the extended statistics here?
And while we're at it, I think it would be great if the implementation will
allow some level of visibility about the skip scan. From what I see, currently
it's by design impossible for users to tell whether something was skipped or
not. But when it comes to planning and estimates, maybe it's not a bad idea to
let explain analyze show something like "expected number of primitive scans /
actual number of primitive scans".
On Sat, Aug 3, 2024 at 3:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
I see that the patch is not supposed to deal with aggregates in any special
way.
Right.
But from what I understand after a quick review, skip scan is not getting
applied to them if there are no quals in the query (in that case
_bt_preprocess_keys returns before calling _bt_preprocess_array_keys).
Right.
Yet such queries could benefit from skipping, I assume they still could be handled by
the machinery introduced in this patch?
I'm not sure.
There are no real changes required inside _bt_advance_array_keys with
this patch -- skip arrays are dealt with in essentially the same way
as conventional arrays (as of Postgres 17). I suspect that loose index
scan would be best implemented using _bt_advance_array_keys. It could
also "plug in" to the existing _bt_advance_array_keys design, I
suppose.
As I touched on already, your loose index scan patch applies
high-level semantic information in a way that is very different to my
skip scan patch. This means that it makes revisions to the index AM
API (if memory serves it adds a callback called amskip to that API).
It also means that loose index scan can actually avoid heap accesses;
loose scans wholly avoid accessing logical rows (in both the index and
the heap) by reasoning that it just isn't necessary to do so at all.
Skipping happens in both data structures. Right?
Obviously, my new skip scan patch cannot possibly reduce the number of
heap page accesses required by a given index scan. Precisely the same
logical rows must be accessed as before. There is no two-way
conversation between the index AM and the table AM about which
rows/row groupings have at least one visible tuple. We're just
navigating through the index more efficiently, without changing any
contract outside of nbtree itself.
The "skip scan" name collision is regrettable. But the fact is that
Oracle, MySQL, and now SQLite all call this feature skip scan. That
feels like the right precedent to follow.
Do I understand correctly, that the only way how multiplying ndistincts could
produce too pessimistic results is when there is a correlation between distinct
values?
Yes, that's one problem with the costing. Not the only one, though.
The true number of primitive index scans depends on the cardinality of
the data. For example, a skip scan might be the cheapest plan by far
if (say) 90% of the index has the same leading column value and the
remaining 10% has totally unique values. We'd still do a bad job of
costing this query with an accurate ndistinct for the leading column.
We really one need to do one or two primitive index scans for "the
first 90% of the index", and one more primitive index scan for "the
remaining 10% of the index". For a query such as this, we "require a
full index scan for the remaining 10% of the index", which is
suboptimal, but doesn't fundamentally change anything (I guess that a
skip scan is always suboptimal, in the sense that you could always do
better by having more indexes).
Can one benefit from the extended statistics here?
I really don't know. Certainly seems possible in cases with more than
one skipped leading column.
The main problem with the costing right now is that it's just not very
well thought through, in general. The performance at runtime depends
on the layout of values in the index itself, so the underlying way
that you'd model the costs doesn't have any great precedent in
costsize.c. We do have some idea of the number of leaf pages we'll
access in btcostestimate(), but that works in a way that isn't really
fit for purpose. It kind of works with one primitive index scan, but
works much less well with multiple primitive scans.
And while we're at it, I think it would be great if the implementation will
allow some level of visibility about the skip scan. From what I see, currently
it's by design impossible for users to tell whether something was skipped or
not. But when it comes to planning and estimates, maybe it's not a bad idea to
let explain analyze show something like "expected number of primitive scans /
actual number of primitive scans".
I agree. I think that that's pretty much mandatory for this patch. At
least the actual number of primitive scans should be exposed. Not
quite as sure about showing the estimated number, since that might be
embarrassingly wrong quite regularly, without it necessarily mattering
that much (I'd worry that it'd be distracting).
Displaying the number of primitive scans would already be useful for
index scans with SAOPs, even without this patch. The same general
concepts (estimated vs. actual primitive index scans) already exist,
as of Postgres 17. That's really nothing new.
--
Peter Geoghegan
On Sat, Aug 3, 2024 at 6:14 PM Peter Geoghegan <pg@bowt.ie> wrote:
Displaying the number of primitive scans would already be useful for
index scans with SAOPs, even without this patch. The same general
concepts (estimated vs. actual primitive index scans) already exist,
as of Postgres 17. That's really nothing new.
We actually expose this via instrumentation, in a certain sense. This
is documented by a "Note":
https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-ALL-INDEXES-VIEW
That is, we already say "Each internal primitive index scan increments
pg_stat_all_indexes.idx_scan, so it's possible for the count of index
scans to significantly exceed the total number of index scan executor
node executions". So, as I said in the last email, advertising the
difference between # of primitive index scans and # of index scan
executor node executions in EXPLAIN ANALYZE is already a good idea.
--
Peter Geoghegan
On Wed, Jul 24, 2024 at 5:14 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v4
Attached is v5, which splits the code from v4 patch into 2 pieces --
it becomes 0002-* and 0003-*. Certain refactoring work now appears
under its own separate patch/commit -- see 0002-* (nothing new here,
except the commit message/patch structure). The patch that actually
adds skip scan (0003-* in this new version) has been further polished,
though not in a way that I think is interesting enough to go into
here.
The interesting and notable change for v5 is the addition of the code
in 0001-*. The new 0001-* patch is concerned with certain aspects of
how _bt_advance_array_keys decides whether to start another primitive
index scan (or to stick with the ongoing one for one more leaf page
instead). This is a behavioral change, albeit a subtle one. It's also
kinda independent of skip scan (more on why that is at the end).
It's easiest to explain why 0001-* matters by way of an example. My
example will show significantly more internal/root page accesses than
seen on master, though only when 0002-* and 0003-* are applied, and
0001-* is omitted. When all 3 v5 patches are applied together, the
total number of index pages accessed by the test query will match the
master branch. It's important that skip scan never loses by much to
the master branch, of course. Even when the details of the index/scan
are inconvenient to the implementation, in whatever way.
Setup:
create table demo (int4 a, numeric b);
create index demo_idx on demo (a, b);
insert into demo select a, random() from generate_series(1, 10000) a,
generate_series(1,5) five_rows_per_a_val;
vacuum demo;
We now have a btree index "demo_idx", which has two levels (a root
page plus a leaf level). The root page contains several hundred pivot
tuples, all of which have their "b" value truncated away (or have the
value -inf, if you prefer), with just one prefix "a" column left in
place. Naturally, every leaf page has a high key with its own
separator key that matches one particular tuple that appears in the
root page (except for the rightmost leaf page). So our leaf level scan
will see lots of truncated leaf page high keys (all matching a
corresponding root page tuple).
Test query:
select a from demo where b > 0.99;
This is a query that really shouldn't be doing any skipping at all. We
nevertheless still see a huge amount of skipping with this query, ocne
0001-* is omitted. Prior to 0001-*, a new primitive index scan is
started whenever the scan reaches a "boundary" between adjoining leaf
pages. That is, whenever _bt_advance_array_keys stopped on a high key
pstate.finaltup. So without the new 0001-* work, the number of page
accesses almost doubles (because we access the root page once per leaf
page accessed, instead of just accessing it once for the whole scan).
What skip scan should have been doing all along (and will do now) is
to step forward to the next right sibling leaf page whenever it
reaches a boundary between leaf pages. This should happen again and
again, without our ever choosing to start a new primitive index scan
instead (it shouldn't happen even once with this query). In other
words, we ought to behave just like a full index scan would behave
with this query -- which is exactly what we get on master.
The scan will still nominally "use skip scan" even with this fix in
place, but in practice, for this particular query/index, the scan
won't ever actually decide to skip. So it at least "looks like" an
index scan from the point of view of EXPLAIN (ANALYZE, BUFFERS). There
is a separate question of how many CPU cycles we use to do all this,
but for now my focus is on total pages accessed by the patch versus on
master, especially for adversarial cases such as this.
It should be noted that the skip scan patch never had any problems
with this very similar query (same table as before):
select a from demo where b < 0.01;
The fact that we did the wrong thing for the first query, but the
right thing for this second similar query, was solely due to certain
accidental implementation details -- it had nothing to do with the
fundamentals of the problem. You might even say that 0001-* makes the
original "b > 0.99" case behave in the same manner as this similar "b
< 0.01" case, which is justifiable on consistency grounds. Why
wouldn't these two cases behave similarly? It's only logical.
The underlying problem arguably has little to do with skip scan;
whether we use a real SAOP array on "a" or a consed up skip array is
incidental to the problem that my example highlights. As always, the
underlying "array type" (skip vs SOAP) only matters to the lowest
level code. And so technically, this is an existing issue on
HEAD/master. You can see that for yourself by making the problematic
query's qual "where a = any ('every possible a value') and b > 0.99"
-- same problem on Postgres 17, without involving skip scan.
To be sure, the underlying problem does become more practically
relevant with the invention of skip arrays for skip scan, but 0001-*
can still be treated as independent work. It can be committed well
ahead of the other stuff IMV. The same is likely also true of the
refactoring now done in 0002-* -- it does refactoring that makes
sense, even without skip scan. And so I don't expect it to take all
that long for it to be committable.
--
Peter Geoghegan
Attachments:
v5-0001-Normalize-nbtree-truncated-high-key-array-behavio.patchapplication/octet-stream; name=v5-0001-Normalize-nbtree-truncated-high-key-array-behavio.patchDownload
From 8c8a3c36daa9c3f69aab6024d0d44e71451fae3b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 8 Aug 2024 13:51:18 -0400
Subject: [PATCH v5 1/3] Normalize nbtree truncated high key array behavior.
Commit 5bf748b8 taught nbtree ScalarArrayOp array processing to decide
when and how to start the next primitive index scan based on physical
index characteristics. This included rules for deciding whether to
start a new primitive index scan (or whether to move onto the right
sibling leaf page instead) whenever the scan encounters a leaf high key
with truncated lower-order columns whose omitted/-inf values are covered
by one or more arrays.
Prior to this commit, nbtree would treat a truncated column as
satisfying a scan key that marked required in the current scan
direction. It would just give up and start a new primitive index scan
in cases involving inequalities required in the opposite direction only
(in practice this meant > and >= strategy scan keys, since only forward
scans consider the page high key like this).
Bring > and >= strategy scan keys in line with other required scan key
types: have nbtree persist with its current primitive index scan
regardless of the operator strategy in use. This requires scheduling
and then performing an explicit check of the next page's high key (if
any) at the point that _bt_readpage is next called.
Although this could be considered a stand alone piece of work, it's
mostly intended as preparation for an upcoming patch that adds skip scan
optimizations to nbtree. Without this work there are cases where the
scan's skip arrays trigger an excessive number of primitive index scans
due to most high keys having a truncated attribute that was previously
treated as not satisfying a required > or >= strategy scan key.
---
src/include/access/nbtree.h | 3 +
src/backend/access/nbtree/nbtree.c | 4 +
src/backend/access/nbtree/nbtsearch.c | 24 ++++++
src/backend/access/nbtree/nbtutils.c | 119 ++++++++++++++------------
4 files changed, 97 insertions(+), 53 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 749304334..5f366323c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1048,6 +1048,7 @@ typedef struct BTScanOpaqueData
int numArrayKeys; /* number of equality-type array keys */
bool needPrimScan; /* New prim scan to continue in current dir? */
bool scanBehind; /* Last array advancement matched -inf attr? */
+ bool oppoDirCheck; /* check opposite dir scan keys? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
FmgrInfo *orderProcs; /* ORDER procs for required equality keys */
MemoryContext arrayContext; /* scan-lifespan context for array data */
@@ -1291,6 +1292,8 @@ extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_preprocess_keys(IndexScanDesc scan);
extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
IndexTuple tuple, int tupnatts);
+extern bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
+ IndexTuple finaltup);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f..e5ce129cc 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -331,6 +331,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->needPrimScan = false;
so->scanBehind = false;
+ so->oppoDirCheck = false;
so->arrayKeys = NULL;
so->orderProcs = NULL;
so->arrayContext = NULL;
@@ -374,6 +375,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
so->markItemIndex = -1;
so->needPrimScan = false;
so->scanBehind = false;
+ so->oppoDirCheck = false;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
@@ -626,6 +628,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
*/
so->needPrimScan = false;
so->scanBehind = false;
+ so->oppoDirCheck = false;
}
else
{
@@ -670,6 +673,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
}
so->needPrimScan = true;
so->scanBehind = false;
+ so->oppoDirCheck = false;
*pageno = InvalidBlockNumber;
exit_loop = true;
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 57bcfc7e4..88f4ef7b7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1679,6 +1679,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
ItemId iid;
IndexTuple itup;
+ Assert(!so->oppoDirCheck);
+
iid = PageGetItemId(page, ScanDirectionIsForward(dir) ? maxoff : minoff);
itup = (IndexTuple) PageGetItem(page, iid);
@@ -1696,6 +1698,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
ItemId iid = PageGetItemId(page, P_HIKEY);
pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+
+ if (unlikely(so->oppoDirCheck))
+ {
+ /*
+ * Last _bt_readpage call scheduled precheck of finaltup for
+ * required scan keys up to and including a > or >= scan key
+ * (necessary because > and >= are only generally considered
+ * required when scanning backwards)
+ */
+ Assert(so->scanBehind);
+ so->oppoDirCheck = false;
+ if (!_bt_oppodir_checkkeys(scan, dir, pstate.finaltup))
+ {
+ /*
+ * Back out of continuing with this leaf page -- schedule
+ * another primitive index scan after all
+ */
+ so->currPos.moreRight = false;
+ so->needPrimScan = true;
+ return false;
+ }
+ }
}
/* load items[] in ascending order */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d6de2072d..1b39d8701 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1362,7 +1362,7 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
curArrayKey->cur_elem = 0;
skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
}
- so->scanBehind = false;
+ so->scanBehind = so->oppoDirCheck = false; /* reset */
}
/*
@@ -1671,8 +1671,7 @@ _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
Assert(so->numArrayKeys);
- /* scanBehind flag doesn't persist across primitive index scans - reset */
- so->scanBehind = false;
+ so->scanBehind = so->oppoDirCheck = false; /* reset */
/*
* Array keys are advanced within _bt_checkkeys when the scan reaches the
@@ -1808,7 +1807,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
tupnatts, false, 0, NULL));
- so->scanBehind = false; /* reset */
+ so->scanBehind = so->oppoDirCheck = false; /* reset */
/*
* Required scan key wasn't satisfied, so required arrays will have to
@@ -2293,19 +2292,27 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
if (so->scanBehind && has_required_opposite_direction_only)
{
/*
- * However, we avoid this behavior whenever the scan involves a scan
+ * However, we do things differently whenever the scan involves a scan
* key required in the opposite direction to the scan only, along with
* a finaltup with at least one truncated attribute that's associated
* with a scan key marked required (required in either direction).
*
* _bt_check_compare simply won't stop the scan for a scan key that's
* marked required in the opposite scan direction only. That leaves
- * us without any reliable way of reconsidering any opposite-direction
+ * us without an automatic way of reconsidering any opposite-direction
* inequalities if it turns out that starting a new primitive index
* scan will allow _bt_first to skip ahead by a great many leaf pages
* (see next section for details of how that works).
+ *
+ * We deal with this by explicitly scheduling a finaltup recheck for
+ * the next page -- we'll call _bt_oppodir_checkkeys for the next
+ * page's finaltup instead. You can think of this as a way of dealing
+ * with this page's finaltup being truncated by checking the next
+ * page's finaltup instead. And you can think of the oppoDirCheck
+ * recheck handling within _bt_readpage as complementing the similar
+ * scanBehind recheck made from within _bt_checkkeys.
*/
- goto new_prim_scan;
+ so->oppoDirCheck = true; /* schedule next page's finaltup recheck */
}
/*
@@ -2343,54 +2350,16 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
* also before the _bt_first-wise start of tuples for our new qual. That
* at least suggests many more skippable pages beyond the current page.
*/
- if (has_required_opposite_direction_only && pstate->finaltup &&
- (all_required_satisfied || oppodir_inequality_sktrig))
+ else if (has_required_opposite_direction_only && pstate->finaltup &&
+ (all_required_satisfied || oppodir_inequality_sktrig) &&
+ unlikely(!_bt_oppodir_checkkeys(scan, dir, pstate->finaltup)))
{
- int nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
- ScanDirection flipped;
- bool continuescanflip;
- int opsktrig;
-
/*
- * We're checking finaltup (which is usually not caller's tuple), so
- * cannot reuse work from caller's earlier _bt_check_compare call.
- *
- * Flip the scan direction when calling _bt_check_compare this time,
- * so that it will set continuescanflip=false when it encounters an
- * inequality required in the opposite scan direction.
+ * Make sure that any non-required arrays are set to the first array
+ * element for the current scan direction
*/
- Assert(!so->scanBehind);
- opsktrig = 0;
- flipped = -dir;
- _bt_check_compare(scan, flipped,
- pstate->finaltup, nfinaltupatts, tupdesc,
- false, false, false,
- &continuescanflip, &opsktrig);
-
- /*
- * Only start a new primitive index scan when finaltup has a required
- * unsatisfied inequality (unsatisfied in the opposite direction)
- */
- Assert(all_required_satisfied != oppodir_inequality_sktrig);
- if (unlikely(!continuescanflip &&
- so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber))
- {
- /*
- * It's possible for the same inequality to be unsatisfied by both
- * caller's tuple (in scan's direction) and finaltup (in the
- * opposite direction) due to _bt_check_compare's behavior with
- * NULLs
- */
- Assert(opsktrig >= sktrig); /* not opsktrig > sktrig due to NULLs */
-
- /*
- * Make sure that any non-required arrays are set to the first
- * array element for the current scan direction
- */
- _bt_rewind_nonrequired_arrays(scan, dir);
-
- goto new_prim_scan;
- }
+ _bt_rewind_nonrequired_arrays(scan, dir);
+ goto new_prim_scan;
}
/*
@@ -3522,7 +3491,8 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
*
* Assert that the scan isn't in danger of becoming confused.
*/
- Assert(!so->scanBehind && !pstate->prechecked && !pstate->firstmatch);
+ Assert(!so->scanBehind && !so->oppoDirCheck);
+ Assert(!pstate->prechecked && !pstate->firstmatch);
Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
tupnatts, false, 0, NULL));
}
@@ -3634,6 +3604,49 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
ikey, true);
}
+/*
+ * Test whether an indextuple satisfies inequalities required in the opposite
+ * direction only (and lower-order equalities required in either direction).
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * dir: current scan direction (flipped by us to get opposite direction)
+ * finaltup: final index tuple on the page
+ *
+ * Caller's finaltup tuple is the page high key (for forwards scans), or the
+ * first non-pivot tuple (for backwards scans). Caller during scans with
+ * required array keys.
+ *
+ * Return true if finatup satisfies keys, false if not. If the tuple fails to
+ * pass the qual, then caller is should start another primitive index scan;
+ * _bt_first can efficiently relocate the scan to a far later leaf page.
+ *
+ * Note: we focus on required-in-opposite-direction scan keys (e.g. for a
+ * required > or >= key, assuming a forwards scan) because _bt_checkkeys() can
+ * always deal with required-in-current-direction scan keys on its own.
+ */
+bool
+_bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
+ IndexTuple finaltup)
+{
+ Relation rel = scan->indexRelation;
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ int nfinaltupatts = BTreeTupleGetNAtts(finaltup, rel);
+ bool continuescan;
+ ScanDirection flipped = -dir;
+ int ikey = 0;
+
+ Assert(so->numArrayKeys);
+
+ _bt_check_compare(scan, flipped, finaltup, nfinaltupatts, tupdesc,
+ false, false, false, &continuescan, &ikey);
+
+ if (!continuescan && so->keyData[ikey].sk_strategy != BTEqualStrategyNumber)
+ return false;
+
+ return true;
+}
+
/*
* Test whether an indextuple satisfies current scan condition.
*
--
2.45.2
v5-0003-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v5-0003-Add-skip-scan-to-nbtree.patchDownload
From e4b75773628cd492edd12addeda8cbbac618b749 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 16 Apr 2024 13:21:36 -0400
Subject: [PATCH v5 3/3] Add skip scan to nbtree.
Skip scan allows nbtree index scans to efficiently use a composite index
on an index (a, b) for queries with a predicate such as "WHERE b = 5".
This is useful in cases where the total number of distinct values in the
column 'a' is reasonably small (think hundreds, possibly thousands).
In effect, a skip scan treats the composite index on (a, b) as if it was
a series of disjunct subindexes -- one subindex per distinct 'a' value.
We exhaustively "search every subindex" using a qual that behaves like
"WHERE a = ANY(<every possible 'a' value>) AND b = 5".
The design of skip scan works by extended the design for arrays
established by commit 5bf748b8. "Skip arrays" generate their array
values procedurally and on-demand, but otherwise work just like arrays
used by SAOPs.
B-Tree operator classes on discrete types can now optionally provide a
skip support routine. This is used to generate the next array element
value by incrementing the current value (or by decrementing, in the case
of backwards scans). When the opclass lacks a skip support routine, we
use sentinel next-key values instead. Adding skip support makes skip
scans more efficient in cases where there is naturally a good chance
that the very next value will find matching tuples. For example, during
an index scan with a leading "sales_date" attribute, there is a decent
chance that a scan that just finished returning tuples matching
"sales_date = '2024-06-01' and id = 5000" will find later tuples
matching "sales_date = '2024-06-02' and id = 5000". It is to our
advantage to skip straight to the relevant "id = 5000" leaf page,
totally avoiding reading earlier "sales_date = '2024-06-02'" leaf pages.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
---
src/include/access/nbtree.h | 25 +-
src/include/catalog/pg_amproc.dat | 16 +
src/include/catalog/pg_proc.dat | 24 +
src/include/utils/skipsupport.h | 109 ++
src/backend/access/nbtree/nbtcompare.c | 261 ++++
src/backend/access/nbtree/nbtsearch.c | 80 +-
src/backend/access/nbtree/nbtutils.c | 1456 +++++++++++++++++--
src/backend/access/nbtree/nbtvalidate.c | 4 +
src/backend/commands/opclasscmds.c | 25 +
src/backend/utils/adt/Makefile | 1 +
src/backend/utils/adt/date.c | 44 +
src/backend/utils/adt/meson.build | 1 +
src/backend/utils/adt/selfuncs.c | 30 +-
src/backend/utils/adt/skipsupport.c | 60 +
src/backend/utils/adt/uuid.c | 67 +
src/backend/utils/misc/guc_tables.c | 23 +
doc/src/sgml/btree.sgml | 13 +
doc/src/sgml/indices.sgml | 40 +-
doc/src/sgml/xindex.sgml | 16 +-
src/test/regress/expected/alter_generic.out | 6 +-
src/test/regress/expected/psql.out | 3 +-
src/test/regress/sql/alter_generic.sql | 2 +-
src/tools/pgindent/typedefs.list | 3 +
23 files changed, 2175 insertions(+), 134 deletions(-)
create mode 100644 src/include/utils/skipsupport.h
create mode 100644 src/backend/utils/adt/skipsupport.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 5f366323c..13a650c3d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+#include "utils/skipsupport.h"
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -709,7 +710,8 @@ BTreeTupleGetMaxHeapTID(IndexTuple itup)
#define BTINRANGE_PROC 3
#define BTEQUALIMAGE_PROC 4
#define BTOPTIONS_PROC 5
-#define BTNProcs 5
+#define BTSKIPSUPPORT_PROC 6
+#define BTNProcs 6
/*
* We need to be able to tell the difference between read and write
@@ -1031,10 +1033,22 @@ typedef BTScanPosData *BTScanPos;
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
typedef struct BTArrayKeyInfo
{
+ /* fields used by both kinds of array (standard arrays and skip arrays) */
int scan_key; /* index of associated key in keyData */
+ int num_elems; /* number of elems (-1 for skip array) */
+
+ /* fields for standard arrays that store elements in memory */
int cur_elem; /* index of current element in elem_values */
- int num_elems; /* number of elems in current array value */
Datum *elem_values; /* array of num_elems Datums */
+
+ /* fields for skip arrays, which generate their elements procedurally */
+ bool use_sksup; /* sksup set to valid routine? */
+ bool null_elem; /* lowest/highest element actually NULL? */
+ SkipSupportData sksup; /* opclass skip scan support, when use_sksup */
+ ScanKey low_compare; /* array's > or >= lower bound */
+ ScanKey high_compare; /* array's < or <= upper bound */
+ FmgrInfo order_low; /* low_compare's ORDER proc */
+ FmgrInfo order_high; /* high_compare's ORDER proc */
} BTArrayKeyInfo;
typedef struct BTScanOpaqueData
@@ -1124,6 +1138,9 @@ typedef struct BTReadPageState
*/
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
+#define SK_BT_SKIP 0x00040000 /* skip array, for skip scan */
+#define SK_BT_NEGPOSINF 0x00080000 /* no sk_argument, -inf/+inf key */
+#define SK_BT_NEXTPRIOR 0x00100000 /* sk_argument is next/prior key */
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1160,6 +1177,10 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+/* GUC parameters (just a temporary convenience for reviewers) */
+extern PGDLLIMPORT int skipscan_prefix_cols;
+extern PGDLLIMPORT bool skipscan_skipsupport_enabled;
+
/*
* external entry points for btree, in nbtree.c
*/
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index f639c3a6a..2a8f6f3f1 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -21,6 +21,8 @@
amprocrighttype => 'bit', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '1', amproc => 'btboolcmp' },
+{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
+ amprocrighttype => 'bool', amprocnum => '6', amproc => 'btboolskipsupport' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
@@ -41,12 +43,16 @@
amprocrighttype => 'char', amprocnum => '1', amproc => 'btcharcmp' },
{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'btcharskipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1', amproc => 'date_cmp' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '2', amproc => 'date_sortsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'date_skipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'timestamp', amprocnum => '1',
amproc => 'date_cmp_timestamp' },
@@ -122,6 +128,8 @@
amprocrighttype => 'int2', amprocnum => '2', amproc => 'btint2sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'btint2skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint24cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
@@ -141,6 +149,8 @@
amprocrighttype => 'int4', amprocnum => '2', amproc => 'btint4sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'btint4skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint48cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
@@ -160,6 +170,8 @@
amprocrighttype => 'int8', amprocnum => '2', amproc => 'btint8sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'btint8skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint84cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
@@ -193,6 +205,8 @@
amprocrighttype => 'oid', amprocnum => '2', amproc => 'btoidsortsupport' },
{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'btoidskipsupport' },
{ amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
amprocrighttype => 'oidvector', amprocnum => '1',
amproc => 'btoidvectorcmp' },
@@ -261,6 +275,8 @@
amprocrighttype => 'uuid', amprocnum => '2', amproc => 'uuid_sortsupport' },
{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'uuid_skipsupport' },
{ amprocfamily => 'btree/record_ops', amproclefttype => 'record',
amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordcmp' },
{ amprocfamily => 'btree/record_image_ops', amproclefttype => 'record',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d36f6001b..2dec83363 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1004,18 +1004,27 @@
{ oid => '3129', descr => 'sort support',
proname => 'btint2sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint2sortsupport' },
+{ oid => '9290', descr => 'skip support',
+ proname => 'btint2skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint2skipsupport' },
{ oid => '351', descr => 'less-equal-greater',
proname => 'btint4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int4 int4', prosrc => 'btint4cmp' },
{ oid => '3130', descr => 'sort support',
proname => 'btint4sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint4sortsupport' },
+{ oid => '9291', descr => 'skip support',
+ proname => 'btint4skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint4skipsupport' },
{ oid => '842', descr => 'less-equal-greater',
proname => 'btint8cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int8 int8', prosrc => 'btint8cmp' },
{ oid => '3131', descr => 'sort support',
proname => 'btint8sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint8sortsupport' },
+{ oid => '9292', descr => 'skip support',
+ proname => 'btint8skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint8skipsupport' },
{ oid => '354', descr => 'less-equal-greater',
proname => 'btfloat4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'float4 float4', prosrc => 'btfloat4cmp' },
@@ -1034,12 +1043,18 @@
{ oid => '3134', descr => 'sort support',
proname => 'btoidsortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btoidsortsupport' },
+{ oid => '9293', descr => 'skip support',
+ proname => 'btoidskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btoidskipsupport' },
{ oid => '404', descr => 'less-equal-greater',
proname => 'btoidvectorcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'oidvector oidvector', prosrc => 'btoidvectorcmp' },
{ oid => '358', descr => 'less-equal-greater',
proname => 'btcharcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'char char', prosrc => 'btcharcmp' },
+{ oid => '9294', descr => 'skip support',
+ proname => 'btcharskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btcharskipsupport' },
{ oid => '359', descr => 'less-equal-greater',
proname => 'btnamecmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'name name', prosrc => 'btnamecmp' },
@@ -2214,6 +2229,9 @@
{ oid => '3136', descr => 'sort support',
proname => 'date_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'date_sortsupport' },
+{ oid => '9295', descr => 'skip support',
+ proname => 'date_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'date_skipsupport' },
{ oid => '4133', descr => 'window RANGE support',
proname => 'in_range', prorettype => 'bool',
proargtypes => 'date date interval bool bool',
@@ -4401,6 +4419,9 @@
{ oid => '1693', descr => 'less-equal-greater',
proname => 'btboolcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'bool bool', prosrc => 'btboolcmp' },
+{ oid => '9296', descr => 'skip support',
+ proname => 'btboolskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btboolskipsupport' },
{ oid => '1688', descr => 'hash',
proname => 'time_hash', prorettype => 'int4', proargtypes => 'time',
@@ -9231,6 +9252,9 @@
{ oid => '3300', descr => 'sort support',
proname => 'uuid_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'uuid_sortsupport' },
+{ oid => '9297', descr => 'skip support',
+ proname => 'uuid_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'uuid_skipsupport' },
{ oid => '2961', descr => 'I/O',
proname => 'uuid_recv', prorettype => 'uuid', proargtypes => 'internal',
prosrc => 'uuid_recv' },
diff --git a/src/include/utils/skipsupport.h b/src/include/utils/skipsupport.h
new file mode 100644
index 000000000..d91390fc6
--- /dev/null
+++ b/src/include/utils/skipsupport.h
@@ -0,0 +1,109 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.h
+ * Support routines for B-Tree skip scan.
+ *
+ * B-Tree operator classes for discrete types can optionally provide a support
+ * function for skipping. This is used during skip scans.
+ *
+ * A B-tree operator class that implements skip support provides B-tree index
+ * scans with a way of enumerating and iterating through every possible value
+ * from the domain of indexable values. This gives scans a way to determine
+ * the next value in line for a given skip array/scan key/skipped attribute.
+ * This happens at the point where the scan determines that another primitive
+ * index scan is required. The next value is used (in combination with at
+ * least one additional lower-order non-skip key, taken from the SQL query) to
+ * relocate the scan, skipping over many irrelevant leaf pages in the process.
+ *
+ * Skip support generally works best with discrete types such as integer,
+ * date, and boolean; types where there is a decent chance that indexes will
+ * contain contiguous values (given a leading attributes using the opclass).
+ * When gaps/discontinuities are naturally rare (e.g., a leading identity
+ * column in a composite index, a date column preceding a product_id column),
+ * then it makes sense for skip scans to optimistically assume that the next
+ * distinct indexable value will find directly matching index tuples.
+ *
+ * The B-Tree code can fall back on next-key sentinel values for any opclass
+ * that doesn't provide its own skip support function. There is no point in
+ * providing skip support unless the next indexed key value is often the next
+ * indexable value (at least with some workloads). Opclasses where that never
+ * works out in practice should just rely on the B-Tree AM's generic next-key
+ * fallback strategy. Opclasses where adding skip support is infeasible or
+ * hard (e.g., an opclass for a continuous type) can also use the fallback.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/skipsupport.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SKIPSUPPORT_H
+#define SKIPSUPPORT_H
+
+#include "utils/relcache.h"
+
+typedef struct SkipSupportData *SkipSupport;
+typedef Datum (*SkipSupportIncDec) (Relation rel,
+ Datum existing,
+ bool *overflow);
+
+/*
+ * State/callbacks used by skip arrays to procedurally generate elements.
+ *
+ * A BTSKIPSUPPORT_PROC function must set each and every field when called.
+ * If an opclass can only set some of the fields, then it cannot safely
+ * provide a skip support routine.
+ */
+typedef struct SkipSupportData
+{
+ /*
+ * low_elem and high_elem must be set with the lowest and highest possible
+ * values from the domain of indexable values (assuming standard ascending
+ * order). This helps the B-Tree code with finding its initial position
+ * at the leaf level (during the skip scan's first primitive index scan).
+ * In other words, it gives the B-Tree code a useful value to start from,
+ * before any data has been read from the index.
+ *
+ * low_elem and high_elem are also used by skip scans to determine when
+ * they've reached the final possible value (in the current direction).
+ * It's typical for the scan to run out of leaf pages before it runs out
+ * of unscanned indexable values, but it's still useful for the scan to
+ * have a way to recognize when it has reached the last possible value
+ * (this saves us a useless probe that just lands on the final leaf page).
+ */
+ Datum low_elem; /* lowest sorting/leftmost non-NULL value */
+ Datum high_elem; /* highest sorting/rightmost non-NULL value */
+
+ /*
+ * Decrement/increment functions.
+ *
+ * Returns a decremented/incremented copy of caller's existing datum,
+ * allocated in caller's memory context (in the case of pass-by-reference
+ * types). It's not okay for these functions to leak any memory.
+ *
+ * Both decrement and increment callbacks are guaranteed to never be
+ * called with a NULL "existing" arg.
+ *
+ * When the decrement function (or increment function) is called with a
+ * value that already matches low_elem (or high_elem), function must set
+ * the *overflow argument. The return value is undefined, and the B-Tree
+ * code is entitled to assume that no memory will have been allocated.
+ *
+ * The B-Tree skip scan caller's "existing" datum is often just a straight
+ * copy of a value from an index tuple. Operator classes must be liberal
+ * in accepting every possible representational variation within the
+ * underlying data type. On the other hand, opclasses are _not_ expected
+ * to preserve any information that doesn't affect how datums are sorted
+ * (e.g., skip support for a fixed precision numeric type isn't required
+ * to preserve datum display scale).
+ */
+ SkipSupportIncDec decrement;
+ SkipSupportIncDec increment;
+} SkipSupportData;
+
+extern bool PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype,
+ bool reverse, SkipSupport sksup);
+
+#endif /* SKIPSUPPORT_H */
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 1c72867c8..deb387453 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -58,6 +58,7 @@
#include <limits.h>
#include "utils/fmgrprotos.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#ifdef STRESS_SORT_INT_MIN
@@ -78,6 +79,49 @@ btboolcmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32((int32) a - (int32) b);
}
+static Datum
+bool_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ if (bexisting == false)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return BoolGetDatum(bexisting - 1);
+}
+
+static Datum
+bool_increment(Relation rel, Datum existing, bool *overflow)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ if (bexisting == true)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return BoolGetDatum(bexisting + 1);
+}
+
+Datum
+btboolskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = bool_decrement;
+ sksup->increment = bool_increment;
+ sksup->low_elem = BoolGetDatum(false);
+ sksup->high_elem = BoolGetDatum(true);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint2cmp(PG_FUNCTION_ARGS)
{
@@ -105,6 +149,49 @@ btint2sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int2_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ if (iexisting == PG_INT16_MIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return Int16GetDatum(iexisting - 1);
+}
+
+static Datum
+int2_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ if (iexisting == PG_INT16_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return Int16GetDatum(iexisting + 1);
+}
+
+Datum
+btint2skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int2_decrement;
+ sksup->increment = int2_increment;
+ sksup->low_elem = Int16GetDatum(PG_INT16_MIN);
+ sksup->high_elem = Int16GetDatum(PG_INT16_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint4cmp(PG_FUNCTION_ARGS)
{
@@ -128,6 +215,49 @@ btint4sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int4_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ if (iexisting == PG_INT32_MIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return Int32GetDatum(iexisting - 1);
+}
+
+static Datum
+int4_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ if (iexisting == PG_INT32_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return Int32GetDatum(iexisting + 1);
+}
+
+Datum
+btint4skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int4_decrement;
+ sksup->increment = int4_increment;
+ sksup->low_elem = Int32GetDatum(PG_INT32_MIN);
+ sksup->high_elem = Int32GetDatum(PG_INT32_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint8cmp(PG_FUNCTION_ARGS)
{
@@ -171,6 +301,49 @@ btint8sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int8_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ if (iexisting == PG_INT64_MIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return Int64GetDatum(iexisting - 1);
+}
+
+static Datum
+int8_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ if (iexisting == PG_INT64_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return Int64GetDatum(iexisting + 1);
+}
+
+Datum
+btint8skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int8_decrement;
+ sksup->increment = int8_increment;
+ sksup->low_elem = Int64GetDatum(PG_INT64_MIN);
+ sksup->high_elem = Int64GetDatum(PG_INT64_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint48cmp(PG_FUNCTION_ARGS)
{
@@ -292,6 +465,49 @@ btoidsortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+oid_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ if (oexisting == InvalidOid)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return ObjectIdGetDatum(oexisting - 1);
+}
+
+static Datum
+oid_increment(Relation rel, Datum existing, bool *overflow)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ if (oexisting == OID_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return ObjectIdGetDatum(oexisting + 1);
+}
+
+Datum
+btoidskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = oid_decrement;
+ sksup->increment = oid_increment;
+ sksup->low_elem = ObjectIdGetDatum(InvalidOid);
+ sksup->high_elem = ObjectIdGetDatum(OID_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btoidvectorcmp(PG_FUNCTION_ARGS)
{
@@ -325,3 +541,48 @@ btcharcmp(PG_FUNCTION_ARGS)
/* Be careful to compare chars as unsigned */
PG_RETURN_INT32((int32) ((uint8) a) - (int32) ((uint8) b));
}
+
+static Datum
+char_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ if (cexisting == 0)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return CharGetDatum((uint8) cexisting - 1);
+}
+
+static Datum
+char_increment(Relation rel, Datum existing, bool *overflow)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ if (cexisting == UCHAR_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return CharGetDatum((uint8) cexisting + 1);
+}
+
+Datum
+btcharskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = char_decrement;
+ sksup->increment = char_increment;
+
+ /* btcharcmp compares chars as unsigned */
+ sksup->low_elem = UInt8GetDatum(0);
+ sksup->high_elem = UInt8GetDatum(UCHAR_MAX);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 88f4ef7b7..05b98efc2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -880,7 +880,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Buffer buf;
BTStack stack;
OffsetNumber offnum;
- StrategyNumber strat;
BTScanInsertData inskey;
ScanKey startKeys[INDEX_MAX_KEYS];
ScanKeyData notnullkeys[INDEX_MAX_KEYS];
@@ -1022,6 +1021,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
ScanKey chosen;
ScanKey impliesNN;
ScanKey cur;
+ int ikey = 0,
+ ichosen = 0;
/*
* chosen is the so-far-chosen key for the current attribute, if any.
@@ -1042,6 +1043,53 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
{
if (i >= so->numberOfKeys || cur->sk_attno != curattr)
{
+ /*
+ * Conceptually, skip arrays consist of array elements whose
+ * values are generated procedurally and on demand. We need
+ * special handling for that here.
+ *
+ * We must interpret various sentinel values to generate an
+ * insertion scan key. This is only actually needed for index
+ * attributes whose input opclass lacks a skip support routine
+ * (when skip support is available we'll always be able to
+ * generate true array element datum values instead).
+ */
+ if (chosen && (chosen->sk_flags & SK_BT_NEGPOSINF))
+ {
+ ScanKey origchosen = chosen;
+ BTArrayKeyInfo *array = NULL;
+
+ for (; ikey < so->numArrayKeys; ikey++)
+ {
+ array = &so->arrayKeys[ikey];
+ if (array->scan_key == ichosen)
+ break;
+ }
+
+ /* use array's inequality key in startKeys[] */
+ if (ScanDirectionIsForward(dir))
+ chosen = array->low_compare;
+ else
+ chosen = array->high_compare;
+
+ if (!chosen && !array->null_elem)
+ {
+ /*
+ * Array doesn't have any explicit low_compare or
+ * high_compare that we can use (given the current
+ * scan direction). The array does not include a NULL
+ * element (to generate an IS NULL qual), though, so
+ * we might need to deduce a NOT NULL key to skip over
+ * any NULLs. Prepare for that.
+ *
+ * Note: this is also how we handle an explicit NOT
+ * NULL key that preprocessing folded into the skip
+ * array.
+ */
+ impliesNN = origchosen;
+ }
+ }
+
/*
* Done looking at keys for curattr. If we didn't find a
* usable boundary key, see if we can deduce a NOT NULL key.
@@ -1075,16 +1123,34 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
break;
startKeys[keysz++] = chosen;
+ /*
+ * Skip arrays can also use a sk_argument which is marked
+ * "next key". This is another sentinel array element value
+ * requiring special handling here by us. As with -inf/+inf
+ * sentinels, there cannot be any exact non-pivot matches.
+ */
+ if (chosen->sk_flags & SK_BT_NEXTPRIOR)
+ {
+ /*
+ * Adjust strat_total, so that our = key gets treated like
+ * a > key (or like a < key)
+ */
+ if (ScanDirectionIsForward(dir))
+ strat_total = BTGreaterStrategyNumber;
+ else
+ strat_total = BTLessStrategyNumber;
+ break;
+ }
+
/*
* Adjust strat_total, and quit if we have stored a > or <
* key.
*/
- strat = chosen->sk_strategy;
- if (strat != BTEqualStrategyNumber)
+ if (chosen->sk_strategy != BTEqualStrategyNumber)
{
- strat_total = strat;
- if (strat == BTGreaterStrategyNumber ||
- strat == BTLessStrategyNumber)
+ strat_total = chosen->sk_strategy;
+ if (chosen->sk_strategy == BTGreaterStrategyNumber ||
+ chosen->sk_strategy == BTLessStrategyNumber)
break;
}
@@ -1103,6 +1169,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
curattr = cur->sk_attno;
chosen = NULL;
impliesNN = NULL;
+ ichosen = -1;
}
/*
@@ -1127,6 +1194,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
case BTEqualStrategyNumber:
/* override any non-equality choice */
chosen = cur;
+ ichosen = i;
break;
case BTGreaterEqualStrategyNumber:
case BTGreaterStrategyNumber:
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7fa977a62..fa046c550 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -29,9 +29,37 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+/*
+ * GUC parameters (temporary convenience for reviewers).
+ *
+ * To disable all skipping, set skipscan_prefix_cols=0. Otherwise set it to
+ * the attribute number that you wish to make the last attribute number that
+ * we can add a skip scan key for. For example, skipscan_prefix_cols=1 makes
+ * an index scan with qual "WHERE b = 1 AND c > 42" generate a skip scan key
+ * on the column 'a' (which is attnum 1) only, preventing us from adding one
+ * for the column 'c' (and so 'c' will still have an inequality scan key,
+ * required in only one direction -- 'c' won't be output as a "range" skip
+ * key/array).
+ */
+int skipscan_prefix_cols = INDEX_MAX_KEYS;
+
+/*
+ * skipscan_skipsupport_enabled can be used to avoid using opclass skip
+ * support routines. This can be used to quantify the peformance benefit that
+ * comes from having dedicated skip support, with a given test query.
+ */
+bool skipscan_skipsupport_enabled = true;
+
#define LOOK_AHEAD_REQUIRED_RECHECKS 3
#define LOOK_AHEAD_DEFAULT_DISTANCE 5
+typedef struct BTSkipPreproc
+{
+ SkipSupportData sksup; /* opclass skip scan support (optional) */
+ bool use_sksup; /* sksup set to valid routine? */
+ Oid eq_op; /* InvalidOid means don't skip */
+} BTSkipPreproc;
+
typedef struct BTSortArrayContext
{
FmgrInfo *sortproc;
@@ -64,15 +92,38 @@ static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
bool *qual_ok);
static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys);
static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
+static int _bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts);
+static bool _bt_skipsupport(Relation rel, int add_skip_attno,
+ BTSkipPreproc *skipatts);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur);
+ Datum arrdatum, bool arrnull,
+ ScanKey cur);
+static void _bt_array_preproc_shrink(ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
+static bool _bt_skip_preproc_shrink(IndexScanDesc scan, ScanKey arraysk,
+ ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
static int _bt_binsrch_array_skey(FmgrInfo *orderproc,
bool cur_elem_trig, ScanDirection dir,
Datum tupdatum, bool tupnull,
BTArrayKeyInfo *array, ScanKey cur,
int32 *set_elem_result);
+static void _bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result);
+static void _bt_scankey_set_low_or_high(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array, bool low_not_high);
+static void _bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull);
+static void _bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static bool _bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static bool _bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
@@ -258,11 +309,19 @@ _bt_freestack(BTStack stack)
* preprocessing steps are complete. This will convert the scan key offset
* references into references to the scan's so->keyData[] output scan keys.
*
+ * We're also responsible for generating skip arrays (and their associated
+ * scan keys) here. This enables skip scan. We do this for index attributes
+ * that initially lacked an equality condition within scan->keyData[], iff
+ * doing so allows a later scan key (that was passed to us in scan->keyData[])
+ * to be marked required by later preprocessing on output.
+ * _bt_decide_skipatts decides which attributes receive skip arrays.
+ *
* Caller must pass *numberOfKeys to give us a way to change the number of
* input scan keys (our output is caller's input). The returned array can be
* smaller than scan->keyData[] when we eliminated a redundant array scan key
- * (redundant with some other array scan key, for the same attribute). Caller
- * uses this to allocate so->keyData[] for the current btrescan.
+ * (redundant with some other array scan key, for the same attribute). It can
+ * also be larger when we added a skip array/skip scan key. Caller uses this
+ * to allocate so->keyData[] for the current btrescan.
*
* Note: the reason we need to return a temp scan key array, rather than just
* scribbling on scan->keyData, is that callers are permitted to call btrescan
@@ -275,8 +334,11 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
Relation rel = scan->indexRelation;
int numArrayKeyData = scan->numberOfKeys;
int16 *indoption = rel->rd_indoption;
+ BTSkipPreproc skipatts[INDEX_MAX_KEYS];
int numArrayKeys,
+ numSkipArrayKeys,
output_ikey = 0;
+ AttrNumber attno_skip = 1;
int origarrayatt = InvalidAttrNumber,
origarraykey = -1;
Oid origelemtype = InvalidOid;
@@ -286,7 +348,10 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
Assert(scan->numberOfKeys);
- /* Quick check to see if there are any array keys */
+ /*
+ * Quick check to see if there are any array keys, or any missing keys we
+ * can generate a "skip scan" array key for ourselves
+ */
numArrayKeys = 0;
for (int i = 0; i < scan->numberOfKeys; i++)
{
@@ -304,6 +369,16 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
}
}
+ /* Consider generating skip arrays, and associated equality scan keys */
+ numSkipArrayKeys = _bt_decide_skipatts(scan, skipatts);
+ if (numSkipArrayKeys)
+ {
+ /* At least one skip array scan key must be added to arrayKeyData[] */
+ numArrayKeys += numSkipArrayKeys;
+ /* output scan key buffer allocation needs space for skip scan keys */
+ numArrayKeyData += numSkipArrayKeys;
+ }
+
/* Quit if nothing to do. */
if (numArrayKeys == 0)
return NULL;
@@ -330,7 +405,12 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
/* Allocate space for ORDER procs used to help _bt_checkkeys */
so->orderProcs = (FmgrInfo *) palloc(numArrayKeyData * sizeof(FmgrInfo));
- /* Now process each array key */
+ /*
+ * Process each array key, and generate skip arrays as needed. Also copy
+ * every scan->keyData[] input scan key (whether it's an array or not)
+ * into the arrayKeyData array we'll return to our caller (barring any
+ * array scan keys that we could eliminate early through array merging).
+ */
numArrayKeys = 0;
for (int input_ikey = 0; input_ikey < scan->numberOfKeys; input_ikey++)
{
@@ -348,6 +428,73 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
int num_nonnulls;
int j;
+ /* Create a skip array and scan key where indicated by skipatts */
+ while (numSkipArrayKeys &&
+ attno_skip <= scan->keyData[input_ikey].sk_attno)
+ {
+ Oid opcintype = rel->rd_opcintype[attno_skip - 1];
+ Oid collation = rel->rd_indcollation[attno_skip - 1];
+ Oid eq_op = skipatts[attno_skip - 1].eq_op;
+ RegProcedure cmp_proc;
+
+ if (!OidIsValid(eq_op))
+ {
+ /* won't skip using this attribute */
+ attno_skip++;
+ continue;
+ }
+
+ cmp_proc = get_opcode(eq_op);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing oprcode for skipping equals operator %u", eq_op);
+
+ cur = &arrayKeyData[output_ikey];
+ Assert(attno_skip <= scan->keyData[input_ikey].sk_attno);
+ ScanKeyEntryInitialize(cur,
+ SK_SEARCHARRAY | SK_BT_SKIP, /* flags */
+ attno_skip, /* skipped att number */
+ BTEqualStrategyNumber, /* equality strategy */
+ InvalidOid, /* opclass input subtype */
+ collation, /* index column's collation */
+ cmp_proc, /* equality operator's proc */
+ (Datum) 0); /* constant */
+
+ /* Initialize array fields */
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
+ so->arrayKeys[numArrayKeys].num_elems = -1;
+ so->arrayKeys[numArrayKeys].cur_elem = 0;
+ so->arrayKeys[numArrayKeys].elem_values = NULL; /* unusued */
+ so->arrayKeys[numArrayKeys].use_sksup = skipatts[attno_skip - 1].use_sksup;
+ so->arrayKeys[numArrayKeys].null_elem = true; /* for now */
+ so->arrayKeys[numArrayKeys].sksup = skipatts[attno_skip - 1].sksup;
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* for now */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* for now */
+
+ /*
+ * Temporary testing GUC can disable the use of an opclass's skip
+ * support routine
+ */
+ if (!skipscan_skipsupport_enabled)
+ so->arrayKeys[numArrayKeys].use_sksup = false;
+
+ /*
+ * We'll need a 3-way ORDER proc to determine when and how the
+ * consed-up "array" will advance inside _bt_advance_array_keys.
+ * Set one up now.
+ */
+ _bt_setup_array_cmp(scan, cur, opcintype,
+ &so->orderProcs[output_ikey], NULL);
+
+ /*
+ * Prepare to output next scan key (might be another skip scan
+ * key, or it could be an input scan key from scan->keyData[])
+ */
+ numSkipArrayKeys--;
+ numArrayKeys++;
+ attno_skip++;
+ output_ikey++; /* keep this scan key/array */
+ }
+
/*
* Copy input scan key into temp arrayKeyData scan key array. (From
* here on, cur points at our copy of the input scan key.)
@@ -522,6 +669,10 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
so->arrayKeys[numArrayKeys].scan_key = output_ikey;
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
+ so->arrayKeys[numArrayKeys].null_elem = false; /* unused */
+ so->arrayKeys[numArrayKeys].use_sksup = false; /* redundant */
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* unused */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* unused */
numArrayKeys++;
output_ikey++; /* keep this scan key/array */
}
@@ -635,7 +786,8 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
{
BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
- Assert(array->num_elems > 0);
+ Assert(array->num_elems > 0 || array->num_elems == -1);
+ Assert(array->num_elems != -1 || outkey->sk_flags & SK_BT_REQFWD);
if (array->scan_key == input_ikey)
{
@@ -696,6 +848,211 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
so->numArrayKeys, INDEX_MAX_KEYS)));
}
+/*
+ * _bt_decide_skipatts() -- set index attributes requiring skip arrays
+ *
+ * _bt_preprocess_array_keys helper function. Determines which attributes
+ * will require skip arrays/scan keys. Also sets up skip support callbacks
+ * for attributes whose input opclass have skip support (opclasses without
+ * skip support will fall back on using next-key sentinel values when
+ * advancing the skip array to its next array element).
+ *
+ * Return value is the total number of scan keys to add as "input" scan keys
+ * for further processing within _bt_preprocess_keys.
+ */
+static int
+_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)
+{
+ Relation rel = scan->indexRelation;
+ ScanKey inputsk;
+ AttrNumber attno_inputsk = 1,
+ attno_skip = 1;
+ bool attno_has_equal = false,
+ attno_has_rowcompare = false;
+ int numSkipArrayKeys = 0,
+ prev_numSkipArrayKeys = 0;
+
+ Assert(scan->numberOfKeys);
+
+ /*
+ * FIXME Don't support parallel index scans for now.
+ *
+ * _bt_parallel_primscan_schedule must be taught to account for skip
+ * arrays. This is likely to require that we store the current array
+ * element datum in shared memory.
+ */
+ if (scan->parallel_scan)
+ return 0;
+
+ /*
+ * Only add skip arrays (and associated scan keys) when doing so will
+ * enable _bt_preprocess_keys to mark one or more lower-order input scan
+ * keys (user-visible scan keys taken from scan->keyData[] input array) as
+ * required to continue the scan.
+ */
+ inputsk = &scan->keyData[0];
+ for (int i = 0;; inputsk++, i++)
+ {
+ /*
+ * Backfill skip arrays for any wholly omitted attributes prior to
+ * attno_inputsk
+ */
+ while (attno_skip < attno_inputsk)
+ {
+ if (!_bt_skipsupport(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /*
+ * Opclass lacks a suitable skip support routine.
+ *
+ * Return prev_numSkipArrayKeys, so as to avoid including any
+ * "backfilled" arrays that were supposed to form a contiguous
+ * group with a skip array on this attribute. There is no
+ * benefit to adding backfill skip arrays unless we can do so
+ * for all attributes (all attributes up to and including the
+ * one immediately before attno_inputsk).
+ */
+ return prev_numSkipArrayKeys;
+ }
+
+ /* plan on adding a backfill skip array for this attribute */
+ numSkipArrayKeys++;
+ attno_skip++;
+ }
+
+ /*
+ * Stop once past the final input scan key. We deliberately never add
+ * a skip attribute for the attribute of the last input scan key.
+ *
+ * If the last input scan key(s) use equality strategy, then a skip
+ * attribute is superfluous at best. If the last input scan key uses
+ * an inequality strategy, then adding a skip scan array/scan key is a
+ * valid though suboptimal transformation. It is better to arrange
+ * for preprocessing to allow such an input inequality scan key to
+ * remain an inequality on output. That way _bt_checkkeys will be
+ * able to make best use of both of its precheck optimizations, but
+ * _bt_first will be no less capable of efficiently finding the
+ * starting position for each primitive index scan.
+ */
+ if (i >= scan->numberOfKeys)
+ break;
+
+ /*
+ * Cannot keep adding skip arrays after a RowCompare
+ */
+ if (attno_has_rowcompare)
+ break;
+
+ /*
+ * Apply temporary testing GUC that can be used to disable skipping
+ * (either in part or in whole)
+ */
+ if (attno_inputsk > skipscan_prefix_cols)
+ break;
+
+ /*
+ * Now consider next attno_inputsk (or keep going if this is an
+ * additional scan key against the same attribute)
+ */
+ if (attno_inputsk < inputsk->sk_attno)
+ {
+ prev_numSkipArrayKeys = numSkipArrayKeys;
+
+ /*
+ * Now add skip array for previous scan key's attribute, though
+ * only if the attribute has no equality strategy scan keys.
+ *
+ * Adding skip arrays to an attribute that has one or more
+ * inequality scan keys will cause preprocessing to output a range
+ * skip array. This will happen when preprocessing proper deals
+ * with the redundancy between the array and its inequalities.
+ */
+ skipatts[attno_skip - 1].eq_op = InvalidOid;
+ if (!attno_has_equal)
+ {
+ /* Only saw inequalities for the prior attribute */
+ if (_bt_skipsupport(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /* add a range skip array for this attribute */
+ numSkipArrayKeys++;
+ }
+ else
+ break;
+ }
+ else
+ {
+ /*
+ * Saw an equality for the prior attribute, so it doesn't need
+ * a skip array (not even a range skip array)
+ */
+ }
+
+ /* Set things up for this new attribute */
+ attno_skip++;
+ attno_inputsk = inputsk->sk_attno;
+ attno_has_equal = false;
+ }
+
+ /*
+ * Track if this scan key's attribute has any equality strategy scan
+ * keys.
+ *
+ * Treat IS NULL scan keys as using equal strategy (they'll be marked
+ * as using it later on, by _bt_fix_scankey_strategy).
+ */
+ if (inputsk->sk_strategy == BTEqualStrategyNumber ||
+ (inputsk->sk_flags & SK_SEARCHNULL))
+ attno_has_equal = true;
+
+ /*
+ * We don't support RowCompare transformation. Remember that we saw a
+ * RowCompare, so that we don't keep adding skip attributes.
+ *
+ * We do still backfill skip attributes before the RowCompare, so that
+ * it can be marked required. This is similar to what happens when a
+ * conventional inequality uses an opclass that lacks skip support.
+ */
+ if (inputsk->sk_flags & SK_ROW_HEADER)
+ attno_has_rowcompare = true;
+ }
+
+ return numSkipArrayKeys;
+}
+
+/*
+ * _bt_skipsupport() -- set up skip support function in *skipatts
+ *
+ * Returns true on success, indicating that we set *skipatts with input
+ * opclass's equality operator. Otherwise returns false.
+ */
+static bool
+_bt_skipsupport(Relation rel, int add_skip_attno, BTSkipPreproc *skipatts)
+{
+ int16 *indoption = rel->rd_indoption;
+ Oid opfamily = rel->rd_opfamily[add_skip_attno - 1];
+ Oid opcintype = rel->rd_opcintype[add_skip_attno - 1];
+ bool reverse;
+
+ /* Look up input opclass's equality operator (might fail) */
+ skipatts->eq_op = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ /*
+ * We don't really expect input opclasses lacking even an equality
+ * operator, but they're still supported. Deal with them gracefully.
+ */
+ if (!OidIsValid(skipatts->eq_op))
+ return false;
+
+ /* Have skip support infrastructure set all SkipSupport fields */
+ reverse = (indoption[add_skip_attno - 1] & INDOPTION_DESC) != 0;
+ skipatts->use_sksup = PrepareSkipSupportFromOpclass(opfamily, opcintype,
+ reverse,
+ &skipatts->sksup);
+
+ /* might not have set up skip support routine, but can skip either way */
+ return true;
+}
+
/*
* _bt_setup_array_cmp() -- Set up array comparison functions
*
@@ -988,17 +1345,15 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok)
{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
Oid opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
- int cmpresult = 0,
- cmpexact = 0,
- matchelem,
- new_nelems = 0;
FmgrInfo crosstypeproc;
FmgrInfo *orderprocp = orderproc;
+ MemoryContext oldContext;
+ bool eliminated;
Assert(arraysk->sk_attno == skey->sk_attno);
- Assert(array->num_elems > 0);
Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
arraysk->sk_strategy == BTEqualStrategyNumber);
@@ -1011,8 +1366,8 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
* datum of opclass input type for the index's attribute (on-disk type).
* We can reuse the array's ORDER proc whenever the non-array scan key's
* type is a match for the corresponding attribute's input opclass type.
- * Otherwise, we have to do another ORDER proc lookup so that our call to
- * _bt_binsrch_array_skey applies the correct comparator.
+ * Otherwise, we have to do another ORDER proc lookup. We have to be sure
+ * that _bt_compare_array_skey/_bt_binsrch_array_skey use the right proc.
*
* Note: we have to support the convention that sk_subtype == InvalidOid
* means the opclass input type; this is a hack to simplify life for
@@ -1043,11 +1398,65 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
return false;
}
- /* We have all we need to determine redundancy/contradictoriness */
+ /* We successfully looked up the required cross-type ORDER proc */
orderprocp = &crosstypeproc;
fmgr_info(cmp_proc, orderprocp);
}
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+
+ /*
+ * Perform preprocessing of the array based on whether it's a conventional
+ * array, or a skip array. Sets *qual_ok correctly in passing.
+ */
+ if (array->num_elems != -1)
+ {
+ _bt_array_preproc_shrink(arraysk, skey, orderprocp, array, qual_ok);
+
+ /*
+ * We successfully looked up the required cross-type ORDER proc, which
+ * ensured that the scalar scan key could be eliminated as redundant
+ */
+ eliminated = true;
+ }
+ else
+ {
+ /*
+ * With a skip array it's possible that we won't be able to eliminate
+ * the scalar scan key, despite looking up the required ORDER proc.
+ * This happens when earlier preprocessing wasn't able to eliminate a
+ * redundant scan key inequality due to a lack of cross-type support.
+ */
+ eliminated = _bt_skip_preproc_shrink(scan, arraysk, skey, orderprocp,
+ array, qual_ok);
+ }
+
+ MemoryContextSwitchTo(oldContext);
+
+ return eliminated;
+}
+
+/*
+ * Finish off preprocessing of conventional (non-skip) array scan key when it
+ * is redundant with (or contradicted by) a non-array scalar scan key.
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Rewrites caller's array in-place as needed to eliminate redundant array
+ * elements. Calling here always renders caller's scalar scan key redundant.
+ */
+static void
+_bt_array_preproc_shrink(ScanKey arraysk, ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ int cmpresult = 0,
+ cmpexact = 0,
+ matchelem,
+ new_nelems = 0;
+
+ Assert(array->num_elems > 0);
+ Assert(!(arraysk->sk_flags & SK_BT_SKIP));
+
matchelem = _bt_binsrch_array_skey(orderprocp, false,
NoMovementScanDirection,
skey->sk_argument, false, array,
@@ -1099,6 +1508,137 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
array->num_elems = new_nelems;
*qual_ok = new_nelems > 0;
+}
+
+/*
+ * Finish off preprocessing of skip array scan key when it is "redundant with"
+ * a non-array scalar scan key. The scalar scan key must be an inequality.
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Unlike _bt_array_preproc_shrink, we cannot really modify caller's array
+ * in-place. Skip arrays work by procedurally generating their elements as
+ * needed, so our approach is to store a copy of the inequality in the skip
+ * array, allowing its elements to be generated within the limits of a range.
+ * Calling here always renders caller's scalar scan key redundant (the key is
+ * applied when the array advances, but that's just an implementation detail).
+ *
+ * Return value indicates if the array already had a lower/upper bound
+ * (whichever caller's scalar scan key was expected to be). We return true in
+ * the common case where caller's scan key could be successfully rolled into
+ * the skip array. We return false when we can't do that due to the presence
+ * of a conflicting inequality.
+ */
+static bool
+_bt_skip_preproc_shrink(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp, BTArrayKeyInfo *array,
+ bool *qual_ok)
+{
+ bool test_result;
+
+ /*
+ * We don't expect to have to deal with NULLs in non-array/non-skip scan
+ * key. We expect _bt_preprocess_array_keys to avoid generating a skip
+ * array for an index attribute with an IS NULL input scan key. (It will
+ * still do so in the presence of IS NOT NULL input scan keys, but
+ * _bt_compare_scankey_args is expected to handle those for us.)
+ */
+ Assert(arraysk->sk_flags & SK_BT_SKIP);
+ Assert(arraysk->sk_flags & SK_SEARCHARRAY);
+ Assert(arraysk->sk_strategy == BTEqualStrategyNumber);
+ Assert(array->num_elems == -1);
+
+ /* Scalar scan key must be a B-Tree inequality, which are always strict */
+ Assert(!(skey->sk_flags & SK_ISNULL));
+ Assert(skey->sk_strategy != BTEqualStrategyNumber);
+
+ /*
+ * Array must not generate a NULL array element (for "IS NULL" qual). Its
+ * index attribute is constrained by a strict operator, so NULL elements
+ * must not be returned by the scan (it would be wrong to allow it).
+ */
+ array->null_elem = false;
+ *qual_ok = true;
+
+ /*
+ * Store a copy of caller's scalar scan key, plus a copy of the operator's
+ * corresponding 3-way ORDER proc.
+ *
+ * A skip array scan key always uses the underlying index attribute's
+ * input opclass, but it's possible that caller's scalar scan key uses a
+ * cross-type operator. In cross-type scenarios, skey.sk_argument doesn't
+ * use the same type as later array elements (which are all just copies of
+ * datums taken from index tuples, possibly modified by skip support).
+ *
+ * We represent the lowest (and highest) possible value in the array using
+ * the sentinel value -inf (+inf for high_compare). The only exceptions
+ * apply when the opclass has skip support: there we can use a copy of the
+ * skip support routine's low_elem/high_elem instead -- though only when
+ * there is no corresponding low_compare/high_compare inequality.
+ *
+ * _bt_first understands that -inf/+inf indicate that it should use the
+ * low_compare/high_compare inequality for initial positioning purposes
+ * when it sees either value (unless there is no corresponding inequality,
+ * in which case the values are literally interpreted as -inf or +inf).
+ * _bt_first can therefore vary in whether it uses a cross-type operator,
+ * or an input-opclass-only operator (it can vary across primitive scans
+ * for the same index attribute/skip array).
+ *
+ * _bt_scankey_decrement/_bt_scankey_increment both make sure that each
+ * newly generated element is constrained by low_compare/high_compare.
+ * This must happen without skey.sk_argument ever being treated as a true
+ * array element (that wouldn't always work because array elements are
+ * only ever supposed to use the opclass input type).
+ */
+ switch (skey->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ case BTLessEqualStrategyNumber:
+ if (array->high_compare)
+ {
+ /* try to keep only one high_compare inequality */
+ if (!_bt_compare_scankey_args(scan, array->high_compare, skey,
+ array->high_compare, NULL, NULL,
+ &test_result))
+ return false; /* can't make new high_compare redundant */
+
+ if (!test_result)
+ return true; /* discard new high_compare */
+
+ /* replace old high_compare with new one */
+ }
+ else
+ array->high_compare = palloc(sizeof(ScanKeyData));
+
+ memcpy(array->high_compare, skey, sizeof(ScanKeyData));
+ array->order_high = *orderprocp;
+ break;
+ case BTGreaterEqualStrategyNumber:
+ case BTGreaterStrategyNumber:
+ if (array->low_compare)
+ {
+ /* try to keep only one low_compare inequality */
+ if (!_bt_compare_scankey_args(scan, array->low_compare, skey,
+ array->low_compare, NULL, NULL,
+ &test_result))
+ return false; /* can't make new low_compare redundant */
+
+ if (!test_result)
+ return true; /* discard new low_compare */
+
+ /* replace old low_compare with new one */
+ }
+ else
+ array->low_compare = palloc(sizeof(ScanKeyData));
+
+ memcpy(array->low_compare, skey, sizeof(ScanKeyData));
+ array->order_low = *orderprocp;
+ break;
+ default:
+ elog(ERROR, "unrecognized StrategyNumber: %d",
+ (int) skey->sk_strategy);
+ break;
+ }
return true;
}
@@ -1141,7 +1681,8 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
static inline int32
_bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur)
+ Datum arrdatum, bool arrnull,
+ ScanKey cur)
{
int32 result = 0;
@@ -1149,14 +1690,14 @@ _bt_compare_array_skey(FmgrInfo *orderproc,
if (tupnull) /* NULL tupdatum */
{
- if (cur->sk_flags & SK_ISNULL)
+ if (arrnull)
result = 0; /* NULL "=" NULL */
else if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+ else if (arrnull) /* NOT_NULL tupdatum, NULL arrdatum */
{
if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -1222,6 +1763,8 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
Datum arrdatum;
Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(!(cur->sk_flags & SK_BT_SKIP));
+ Assert(!(cur->sk_flags & SK_ISNULL)); /* plain arrays can't do this */
Assert(cur->sk_strategy == BTEqualStrategyNumber);
if (cur_elem_trig)
@@ -1257,7 +1800,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[low_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result <= 0)
{
@@ -1285,7 +1828,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[high_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result >= 0)
{
@@ -1312,7 +1855,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
arrdatum = array->elem_values[mid_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result == 0)
{
@@ -1337,13 +1880,102 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
*/
if (low_elem != mid_elem)
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- array->elem_values[low_elem], cur);
+ array->elem_values[low_elem], false,
+ cur);
*set_elem_result = result;
return low_elem;
}
+/*
+ * _bt_binsrch_skiparray_skey() -- "Binary search" within a skip array
+ *
+ * This routine doesn't return an index into the array, because the array
+ * doesn't actually have any elements (it generates its array elements
+ * procedurally instead). Note that this may include a NULL value/an IS NULL
+ * qual.
+ *
+ * Sets *set_elem_result just like _bt_binsrch_array_skey would with a true
+ * array. The value 0 indicates that tupdatum/tupnull is within the range of
+ * the skip array. Other values indicate what _bt_compare_array_skey returned
+ * for the best available match to tupdatum/tupnull (in practice this means
+ * either the lowest item or the highest item in the range of the array).
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this array's
+ * scan key. We use this to optimize-away comparisons that are known by our
+ * caller to be unnecessary from context, just like _bt_binsrch_array_skey.
+ */
+static void
+_bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result)
+{
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_flags & SK_BT_REQFWD);
+ Assert(array->num_elems == -1);
+ Assert(!ScanDirectionIsNoMovement(dir));
+
+ if (tupnull) /* NULL tupdatum */
+ {
+ if (array->null_elem)
+ *set_elem_result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ *set_elem_result = -1; /* NULL "<" NOT_NULL */
+ else
+ *set_elem_result = 1; /* NULL ">" NOT_NULL */
+
+ return;
+ }
+
+ /*
+ * Array inequalities determine whether tupdatum is within the range of
+ * caller's skip array
+ */
+ *set_elem_result = 0;
+ if (ScanDirectionIsForward(dir))
+ {
+ /*
+ * Evaluate low_compare first (unless cur_elem_trig tells us that it
+ * cannot possibly fail to be satisfied), then evaluate high_compare
+ */
+ if (!cur_elem_trig && array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ tupdatum,
+ array->low_compare->sk_argument)))
+ *set_elem_result = -1;
+ else if (array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ tupdatum,
+ array->high_compare->sk_argument)))
+ *set_elem_result = 1;
+ }
+ else
+ {
+ /*
+ * Evaluate high_compare first (unless cur_elem_trig tells us that it
+ * cannot possibly fail to be satisfied), then evaluate low_compare
+ */
+ if (!cur_elem_trig && array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ tupdatum,
+ array->high_compare->sk_argument)))
+ *set_elem_result = 1;
+ else if (array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ tupdatum,
+ array->low_compare->sk_argument)))
+ *set_elem_result = -1;
+ }
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -1353,29 +1985,506 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
void
_bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- int i;
Assert(so->numArrayKeys);
Assert(so->qual_ok);
- for (i = 0; i < so->numArrayKeys; i++)
+ for (int i = 0; i < so->numArrayKeys; i++)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
- Assert(curArrayKey->num_elems > 0);
Assert(skey->sk_flags & SK_SEARCHARRAY);
- if (ScanDirectionIsBackward(dir))
- curArrayKey->cur_elem = curArrayKey->num_elems - 1;
- else
- curArrayKey->cur_elem = 0;
- skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
+ _bt_scankey_set_low_or_high(rel, skey, curArrayKey,
+ ScanDirectionIsForward(dir));
}
so->scanBehind = so->oppoDirCheck = false; /* reset */
}
+/*
+ * _bt_scankey_set_low_or_high() -- Set array scan key to lowest/highest element
+ *
+ * Caller also passes associated scan key, which will have its argument set to
+ * the lowest/highest array value in passing.
+ */
+static void
+_bt_scankey_set_low_or_high(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ bool low_not_high)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (array->num_elems != -1)
+ {
+ /* set low or high element for conventional array */
+ int set_elem = 0;
+
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+
+ if (!low_not_high)
+ set_elem = array->num_elems - 1;
+
+ /*
+ * Just copy over array datum (only skip arrays require freeing and
+ * allocating memory for sk_argument)
+ */
+ array->cur_elem = set_elem;
+ skey->sk_argument = array->elem_values[set_elem];
+
+ return;
+ }
+
+ /* set low or high element for skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Clear possibly-irrelevant flags */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR);
+
+ if (array->null_elem &&
+ (low_not_high == ((skey->sk_flags & SK_BT_NULLS_FIRST) != 0)))
+ {
+ /* Lowest (or highest) element is NULL, so set scan key to NULL */
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+ }
+ else if (low_not_high)
+ {
+ /* Lowest array element isn't NULL */
+ if (array->use_sksup && !array->low_compare)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= SK_BT_NEGPOSINF;
+ }
+ else
+ {
+ /* Highest array element isn't NULL */
+ if (array->use_sksup && !array->high_compare)
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= SK_BT_NEGPOSINF;
+ }
+}
+
+/*
+ * _bt_scankey_set_element() -- Set skip array scan key's sk_argument
+ *
+ * Sets scan key to "IS NULL" when required, and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull)
+{
+ /* tupdatum within the range of low_value/high_value */
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(tupnull && !array->null_elem));
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR);
+
+ /*
+ * Treat tupdatum/tupnull as a matching array element.
+ *
+ * We just copy tupdatum into the array's scan key (there is no
+ * conventional array element for us to set, of course).
+ *
+ * Unlike standard arrays, skip arrays sometimes need to locate NULLs.
+ * Treat them as just another value from the domain of indexed values.
+ */
+ if (!tupnull)
+ skey->sk_argument = datumCopy(tupdatum, attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
+/*
+ * _bt_scankey_unset_isnull() -- increment/decrement scan key from NULL
+ *
+ * Unsets scan key's "IS NULL" marking, and sets the non-NULL value from the
+ * array immediately before (or immediate after) NULL in the key space.
+ */
+static void
+_bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(skey->sk_flags & SK_SEARCHNULL);
+ Assert(skey->sk_flags & SK_ISNULL);
+ Assert(!(skey->sk_flags & (SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR)));
+ Assert(skey->sk_argument == 0);
+ Assert(array->use_sksup && array->null_elem &&
+ !array->low_compare && !array->high_compare);
+
+ /*
+ * sk_argument must be set to whatever non-NULL value comes immediately
+ * before or after NULL
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL);
+ if (skey->sk_flags & SK_BT_NULLS_FIRST)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+}
+
+/*
+ * _bt_scankey_set_isnull() -- decrement/increment scan key to NULL
+ */
+static void
+_bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & (SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR)));
+ Assert(array->null_elem);
+ Assert(!array->low_compare && !array->high_compare);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set sk_argument to NULL */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
+/*
+ * _bt_scankey_decrement() -- decrement array scan key's sk_argument
+ *
+ * Return value indicates whether caller's array was successfully decremented.
+ * Cannot decrement an array whose current element is already the first one.
+ */
+static bool
+_bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ bool underflow = false;
+ Datum dec_sk_argument;
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_BT_NEXTPRIOR));
+
+ /* Regular (non-skip) array? */
+ if (array->num_elems != -1)
+ {
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ if (array->cur_elem > 0)
+ {
+ /*
+ * Just copy over array datum (only skip arrays require freeing
+ * and allocating memory for sk_argument)
+ */
+ array->cur_elem--;
+ skey->sk_argument = array->elem_values[array->cur_elem];
+
+ /* Successfully decremented array */
+ return true;
+ }
+
+ /* Cannot decrement to before first array element */
+ return false;
+ }
+
+ /* Nope, this is a skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ /* The sentinel value -inf is never decrementable */
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ return false;
+
+ /*
+ * When the current array element is NULL, and the lowest sorting value in
+ * the index is also NULL, we cannot decrement before first array element
+ */
+ if ((skey->sk_flags & SK_ISNULL) && (skey->sk_flags & SK_BT_NULLS_FIRST))
+ return false;
+
+ /*
+ * Opclasses without skip support "decrement" the scan key's current
+ * element by setting the NEXTPRIOR flag. The true prior value can only
+ * be determined when the scan reads lower sorting tuples.
+ *
+ * When the current array element is NULL, and the highest sorting value
+ * in the index is also NULL, _bt_first can find the highest non-NULL.
+ */
+ if (!array->use_sksup)
+ {
+ /*
+ * Determine as best we can (given the lack of skip support) whether
+ * the prior element will turn out to be out of bounds for the skip
+ * array.
+ *
+ * Skip arrays (that lack skip support) can only do this when their
+ * low_compare is for an >= inequality; if the current array element
+ * is == the inequality's sk_argument, then the true prior value
+ * cannot possibly satisfy low_compare. We can give up right away.
+ */
+ if (array->low_compare &&
+ array->low_compare->sk_strategy == BTGreaterEqualStrategyNumber &&
+ _bt_compare_array_skey(&array->order_low,
+ array->low_compare->sk_argument, false,
+ skey->sk_argument, false,
+ skey) == 0)
+ return false;
+
+ /* else the scan must figure out the true prior value */
+ skey->sk_flags |= SK_BT_NEXTPRIOR;
+ return true;
+ }
+
+ /*
+ * Opclasses with skip support decrement the scan key's current element
+ * using a callback
+ */
+ if (skey->sk_flags & SK_ISNULL)
+ {
+ Assert(!(skey->sk_flags & SK_BT_NULLS_FIRST));
+
+ /*
+ * Existing sk_argument/array element is NULL (for an IS NULL qual).
+ *
+ * Decrement current array element to the high_elem value provided by
+ * opclass skip support routine.
+ */
+ _bt_scankey_unset_isnull(rel, skey, array);
+ return true;
+ }
+
+ /*
+ * Ask opclass support routine to provide decremented copy of existing
+ * non-NULL sk_argument
+ */
+ dec_sk_argument = array->sksup.decrement(rel, skey->sk_argument, &underflow);
+
+ if (underflow)
+ {
+ if (array->null_elem && (skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to non-NULL low_elem
+ * provided by opclass skip support routine, but skip array's true
+ * lowest element is actually NULL.
+ *
+ * Decrement sk_argument to NULL.
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+ return true;
+ }
+
+ /* Cannot decrement before first array element */
+ return false;
+ }
+
+ /*
+ * Successfully decremented sk_argument to a non-NULL value. Make sure
+ * that the decremented value is still within the range of the skip array.
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ dec_sk_argument,
+ array->low_compare->sk_argument)))
+ {
+ /* Keep existing sk_argument after all */
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(dec_sk_argument));
+
+ /* Cannot decrement before first array element */
+ return false;
+ }
+
+ /* Accept non-NULL datum value from opclass decrement callback */
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = dec_sk_argument;
+
+ return true;
+}
+
+/*
+ * _bt_scankey_increment() -- increment array scan key's sk_argument
+ *
+ * Return value indicates whether caller's array was successfully incremented.
+ * Cannot increment an array whose current element is already the final one.
+ */
+static bool
+_bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ bool overflow = false;
+ Datum inc_sk_argument;
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_BT_NEXTPRIOR));
+
+ /* Regular (non-skip) array? */
+ if (array->num_elems != -1)
+ {
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ if (array->cur_elem < array->num_elems - 1)
+ {
+ /*
+ * Just copy over array datum (only skip arrays require freeing
+ * and allocating memory for sk_argument)
+ */
+ array->cur_elem++;
+ skey->sk_argument = array->elem_values[array->cur_elem];
+
+ /* Successfully incremented array */
+ return true;
+ }
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /* Nope, this is a skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ /* The sentinel value +inf is never incrementable */
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ return false;
+
+ /*
+ * When the current array element is NULL, and the highest sorting value
+ * in the index is also NULL, we cannot increment past the final element
+ */
+ if ((skey->sk_flags & SK_ISNULL) && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ return false;
+
+ /*
+ * Opclasses without skip support "increment" the scan key's current
+ * element by setting the NEXTPRIOR flag. The true next value can only be
+ * determined when the scan reads higher sorting tuples.
+ *
+ * When the current array element is NULL, and the lowest sorting value in
+ * the index is also NULL, _bt_first can find the lowest non-NULL.
+ */
+ if (!array->use_sksup)
+ {
+ /*
+ * Determine as best we can (given the lack of skip support) whether
+ * the next element will turn out to be out of bounds for the skip
+ * array.
+ *
+ * Skip arrays (that lack skip support) can only do this when their
+ * high_compare is for an <= inequality; if the current array element
+ * is == the inequality's sk_argument, then the true next value cannot
+ * possibly satisfy high_compare. We can give up right away.
+ */
+ if (array->high_compare &&
+ array->high_compare->sk_strategy == BTLessEqualStrategyNumber &&
+ _bt_compare_array_skey(&array->order_high,
+ array->high_compare->sk_argument, false,
+ skey->sk_argument, false,
+ skey) == 0)
+ return false;
+
+ /* else the scan must figure out the true next value */
+ skey->sk_flags |= SK_BT_NEXTPRIOR;
+ return true;
+ }
+
+ /*
+ * Opclasses with skip support increment the scan key's current element
+ * using a callback
+ */
+ if (skey->sk_flags & SK_ISNULL)
+ {
+ Assert(skey->sk_flags & SK_BT_NULLS_FIRST);
+
+ /*
+ * Existing sk_argument/array element is NULL (for an IS NULL qual).
+ *
+ * Increment current array element to the low_elem value provided by
+ * opclass skip support routine.
+ */
+ _bt_scankey_unset_isnull(rel, skey, array);
+ return true;
+ }
+
+ /*
+ * Ask opclass support routine to provide incremented copy of existing
+ * non-NULL sk_argument
+ */
+ inc_sk_argument = array->sksup.increment(rel, skey->sk_argument, &overflow);
+
+ if (overflow)
+ {
+ if (array->null_elem && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to non-NULL high_elem
+ * provided by opclass skip support routine, but skip array's true
+ * highest element is actually NULL.
+ *
+ * Increment sk_argument to NULL.
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+ return true;
+ }
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /*
+ * Successfully incremented sk_argument to a non-NULL value. Make sure
+ * that the incremented value is still within the range of the skip array.
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ inc_sk_argument,
+ array->high_compare->sk_argument)))
+ {
+ /* Keep existing sk_argument after all */
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(inc_sk_argument));
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /* Accept non-NULL datum value from opclass increment callback */
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = inc_sk_argument;
+
+ return true;
+}
+
/*
* _bt_advance_array_keys_increment() -- Advance to next set of array elements
*
@@ -1391,6 +2500,7 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
static bool
_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
/*
@@ -1400,29 +2510,30 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
*/
for (int i = so->numArrayKeys - 1; i >= 0; i--)
{
- BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
- ScanKey skey = &so->keyData[curArrayKey->scan_key];
- int cur_elem = curArrayKey->cur_elem;
- int num_elems = curArrayKey->num_elems;
- bool rolled = false;
+ BTArrayKeyInfo *array = &so->arrayKeys[i];
+ ScanKey skey = &so->keyData[array->scan_key];
- if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
+ if (ScanDirectionIsForward(dir))
{
- cur_elem = 0;
- rolled = true;
+ if (_bt_scankey_increment(rel, skey, array))
+ return true;
}
- else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
+ else
{
- cur_elem = num_elems - 1;
- rolled = true;
+ if (_bt_scankey_decrement(rel, skey, array))
+ return true;
}
- curArrayKey->cur_elem = cur_elem;
- skey->sk_argument = curArrayKey->elem_values[cur_elem];
- if (!rolled)
- return true;
+ /*
+ * Handle array roll over.
+ *
+ * Start over at the array's lowest sorting value (or its highest
+ * value, for backward scans)...
+ */
+ _bt_scankey_set_low_or_high(rel, skey, array,
+ ScanDirectionIsForward(dir));
- /* Need to advance next array key, if any */
+ /* ...then advance next most significant array, if any */
}
/*
@@ -1477,6 +2588,7 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
static void
_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int arrayidx = 0;
@@ -1484,7 +2596,6 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
ScanKey cur = so->keyData + ikey;
BTArrayKeyInfo *array = NULL;
- int first_elem_dir;
if (!(cur->sk_flags & SK_SEARCHARRAY) ||
cur->sk_strategy != BTEqualStrategyNumber)
@@ -1496,16 +2607,10 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
continue;
- if (ScanDirectionIsForward(dir))
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
+ Assert(array->num_elems != -1); /* No skipping of non-required arrays */
- if (array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -1569,6 +2674,8 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
{
ScanKey cur = so->keyData + ikey;
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
Datum tupdatum;
bool tupnull;
int32 result;
@@ -1630,9 +2737,67 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
- result = _bt_compare_array_skey(&so->orderProcs[ikey],
- tupdatum, tupnull,
- cur->sk_argument, cur);
+ if (!(cur->sk_flags & SK_BT_NEGPOSINF))
+ {
+ /* Just use the array's current array element */
+ result = _bt_compare_array_skey(&so->orderProcs[ikey],
+ tupdatum, tupnull,
+ sk_argument, sk_isnull, cur);
+
+ /*
+ * When scan key is marked NEXTPRIOR, the current array element is
+ * "sk_argument + infinitesimal" (or the current array element is
+ * "sk_argument - infinitesimal", during backwards scans)
+ */
+ if (result == 0 && (cur->sk_flags & SK_BT_NEXTPRIOR))
+ {
+ /*
+ * tupdatum is actually still < "sk_argument + infinitesimal"
+ * (or it's actually still > "sk_argument - infinitesimal")
+ */
+ return true;
+ }
+ }
+ else
+ {
+ /*
+ * The scankey lacks a conventional sk_argument/element value,
+ * since it's marked as containing the sentinel value -inf/+inf.
+ *
+ * Note: -inf could mean "absolute" -inf, or it could represent
+ * the lowest possible value that still satisfies the array's
+ * low_compare. +inf and high_compare work similarly.
+ */
+ BTArrayKeyInfo *array = NULL;
+
+ for (int arrayidx = 0; arrayidx < so->numArrayKeys; arrayidx++)
+ {
+ array = &so->arrayKeys[arrayidx];
+ if (array->scan_key == ikey)
+ break;
+ }
+
+ /*
+ * Compare tupdatum against -inf using array's low_compare, if any
+ * (or compare it against +inf using array's high_compare).
+ *
+ * Optimization: avoid uselessly evaluating array's high_compare
+ * (or uselessly evaluating array's low_compare) by passing
+ * cur_elem_trig=true, along with an inverted scan direction.
+ */
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey], true, -dir,
+ tupdatum, tupnull, array, cur,
+ &result);
+
+ if (result == 0)
+ {
+ /*
+ * tupdatum is > -inf sk_argument (or < +inf sk_argument).
+ * It's time for caller to advance the scan's array keys.
+ */
+ return false;
+ }
+ }
/*
* Does this comparison indicate that caller must _not_ advance the
@@ -1964,18 +3129,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (beyond_end_advance)
{
- int final_elem_dir;
-
- if (ScanDirectionIsBackward(dir) || !array)
- final_elem_dir = 0;
- else
- final_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != final_elem_dir)
- {
- array->cur_elem = final_elem_dir;
- cur->sk_argument = array->elem_values[final_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
continue;
}
@@ -2000,18 +3156,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (!all_required_satisfied || cur->sk_attno > tupnatts)
{
- int first_elem_dir;
-
- if (ScanDirectionIsForward(dir) || !array)
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
continue;
}
@@ -2029,15 +3176,27 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
/*
* Binary search for closest match that's available from the array
*/
- set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
- cur_elem_trig, dir,
- tupdatum, tupnull, array, cur,
- &result);
+ if (array->num_elems != -1)
+ set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
- Assert(set_elem >= 0 && set_elem < array->num_elems);
+ /*
+ * Skip array. "Binary search" by checking if tupdatum/tupnull
+ * are within the low_value/high_value range of the skip array.
+ */
+ else
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
}
else
{
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
+
Assert(sktrig_required && required);
/*
@@ -2051,7 +3210,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull, cur);
}
/*
@@ -2110,11 +3269,76 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
}
}
- /* Advance array keys, even when set_elem isn't an exact match */
- if (array && array->cur_elem != set_elem)
+ /* Advance array keys, even when we don't have an exact match */
+
+ if (!array)
+ continue; /* no element to set in non-array */
+
+ /* Conventional arrays have a valid set_elem for us to advance to */
+ if (array->num_elems != -1)
{
- array->cur_elem = set_elem;
- cur->sk_argument = array->elem_values[set_elem];
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ cur->sk_argument = array->elem_values[set_elem];
+ }
+
+ continue;
+ }
+
+ /*
+ * Skip arrays generate array elements procedurally and on demand.
+ * They "contain" elements for every possible datum from a given range
+ * of values. This is often the range -inf through to +inf.
+ */
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+ Assert(required);
+
+ /*
+ * When a binary search of a conventional array locates a set_elem
+ * that is merely the best available match for tupdatum (not an exact
+ * match), set_elem isn't necessarily set to the absolute lowest or
+ * highest array element (though we must set subsequent lower-order
+ * !all_required_satisfied arrays that way, as the process cascades).
+ *
+ * However, when a "binary search" of a skip array finds that tupdatum
+ * isn't within the range of the skip array, we always advance the
+ * array to either the highest or the lowest possible element value
+ * (it's often set to either the +inf or the -inf element/value).
+ * There can be no "gaps between array elements", so either we find an
+ * exact match or we follow the same steps followed for later arrays
+ * that array advancement will cascade to.
+ */
+ if (beyond_end_advance)
+ {
+ /*
+ * We need to set the array element to the final element in the
+ * current scan direction for "beyond end of array element" array
+ * advancement
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
+ }
+ else if (!all_required_satisfied)
+ {
+ /*
+ * The closest matching element is the lowest element; even that
+ * still puts us ahead of caller's tuple in the key space
+ */
+ Assert(sktrig < ikey); /* Caller must get this right */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ /*
+ * Search found tupdatum within the range of the skip array.
+ *
+ * Set scan key's sk_argument to tupdatum. If tupdatum is null,
+ * we'll set IS NULL flags in scan key's sk_flags instead.
+ */
+ _bt_scankey_set_element(rel, cur, array, tupdatum, tupnull);
}
}
@@ -2465,6 +3689,8 @@ end_toplevel_scan:
* within each attribute may be done as a byproduct of the processing here.
* That process must leave array scan keys (within an attribute) in the same
* order as corresponding entries from the scan's BTArrayKeyInfo array info.
+ * We might also cons up skip array scan keys that weren't present in the
+ * original input keys; these are also output in standard attribute order.
*
* The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
* if they must be satisfied in order to continue the scan forward or backward
@@ -2588,8 +3814,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
inputsk = scan->keyData;
/*
- * Now that we have an estimate of the number of output scan keys,
- * allocate space for them
+ * Now that we have an estimate of the number of output scan keys
+ * (including any skip array scan keys), allocate space for them
*/
so->keyData = palloc(sizeof(ScanKeyData) * numberOfKeys);
@@ -2725,7 +3951,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
return;
}
/* else discard the redundant non-equality key */
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
xform[j].skey = NULL;
xform[j].ikey = -1;
}
@@ -2888,7 +4115,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
/* Have all we need to determine redundancy */
if (test_result)
{
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
/*
* New key is more restrictive, and so replaces old key...
@@ -3030,10 +4258,11 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
if (array->scan_key != ikey)
return false;
- if (array->num_elems <= 0)
+ if (array->num_elems == 0 || array->num_elems < -1)
return false;
- if (cur->sk_argument != array->elem_values[array->cur_elem])
+ if (array->num_elems != -1 &&
+ cur->sk_argument != array->elem_values[array->cur_elem])
return false;
if (last_sk_attno > cur->sk_attno)
return false;
@@ -3108,6 +4337,22 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool leftnull,
rightnull;
+ /* Handle skip array comparison with IS NOT NULL scan key */
+ if ((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP)
+ {
+ /* Shouldn't generate skip array in presence of IS NULL key */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNULL));
+ Assert((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNOTNULL);
+
+ /* Skip array will have no NULL element/IS NULL scan key */
+ Assert(array->num_elems == -1);
+ array->null_elem = false;
+
+ /* IS NOT NULL key (could be leftarg or rightarg) now redundant */
+ *result = true;
+ return true;
+ }
+
if (leftarg->sk_flags & SK_ISNULL)
{
Assert(leftarg->sk_flags & (SK_SEARCHNULL | SK_SEARCHNOTNULL));
@@ -3181,6 +4426,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
{
/* Can't make the comparison */
*result = false; /* suppress compiler warnings */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
return false;
}
@@ -3744,6 +4990,20 @@ _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
continue;
}
+ /*
+ * A skip array scan key might be negative/positive infinity. Might
+ * also be next key/prior key sentinel, which we don't deal with.
+ */
+ if (key->sk_flags & (SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR))
+ {
+ Assert(key->sk_flags & SK_SEARCHARRAY);
+ Assert(key->sk_flags & SK_BT_SKIP);
+ Assert(requiredSameDir);
+
+ *continuescan = false;
+ return false;
+ }
+
/* row-comparison keys need special processing */
if (key->sk_flags & SK_ROW_HEADER)
{
diff --git a/src/backend/access/nbtree/nbtvalidate.c b/src/backend/access/nbtree/nbtvalidate.c
index e9d4cd60d..96d0d9185 100644
--- a/src/backend/access/nbtree/nbtvalidate.c
+++ b/src/backend/access/nbtree/nbtvalidate.c
@@ -114,6 +114,10 @@ btvalidate(Oid opclassoid)
case BTOPTIONS_PROC:
ok = check_amoptsproc_signature(procform->amproc);
break;
+ case BTSKIPSUPPORT_PROC:
+ ok = check_amproc_signature(procform->amproc, VOIDOID, true,
+ 1, 1, INTERNALOID);
+ break;
default:
ereport(INFO,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index b8b5c147c..a86dbf71b 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -1330,6 +1330,31 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("btree equal image functions must not be cross-type")));
}
+ else if (member->number == BTSKIPSUPPORT_PROC)
+ {
+ if (procform->pronargs != 1 ||
+ procform->proargtypes.values[0] != INTERNALOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must accept type \"internal\"")));
+ if (procform->prorettype != VOIDOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must return void")));
+
+ /*
+ * pg_amproc functions are indexed by (lefttype, righttype), but a
+ * skip support function doesn't make sense in cross-type
+ * scenarios. The same opclass opcintype OID is always used for
+ * lefttype and righttype. Providing a cross-type routine isn't
+ * sensible. Reject cross-type ALTER OPERATOR FAMILY ... ADD
+ * FUNCTION 6 statements here.
+ */
+ if (member->lefttype != member->righttype)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must not be cross-type")));
+ }
}
else if (amoid == HASH_AM_OID)
{
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index edb09d4e3..e945686c8 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -96,6 +96,7 @@ OBJS = \
rowtypes.o \
ruleutils.o \
selfuncs.o \
+ skipsupport.o \
tid.o \
timestamp.o \
trigfuncs.o \
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 9c854e0e5..79658f068 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -34,6 +34,7 @@
#include "utils/date.h"
#include "utils/datetime.h"
#include "utils/numeric.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
/*
@@ -455,6 +456,49 @@ date_sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+date_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ if (dexisting == DATEVAL_NOBEGIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return DateADTGetDatum(dexisting - 1);
+}
+
+static Datum
+date_increment(Relation rel, Datum existing, bool *overflow)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ if (dexisting == DATEVAL_NOEND)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return DateADTGetDatum(dexisting + 1);
+}
+
+Datum
+date_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = date_decrement;
+ sksup->increment = date_increment;
+ sksup->low_elem = DateADTGetDatum(DATEVAL_NOBEGIN);
+ sksup->high_elem = DateADTGetDatum(DATEVAL_NOEND);
+
+ PG_RETURN_VOID();
+}
+
Datum
date_finite(PG_FUNCTION_ARGS)
{
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index 8c6fc80c3..91682edd5 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -83,6 +83,7 @@ backend_sources += files(
'rowtypes.c',
'ruleutils.c',
'selfuncs.c',
+ 'skipsupport.c',
'tid.c',
'timestamp.c',
'trigfuncs.c',
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bf42393be..0e6e9ebb9 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6806,6 +6806,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
List *indexBoundQuals;
int indexcol;
bool eqQualHere;
+ bool found_skip;
bool found_saop;
bool found_is_null_op;
double num_sa_scans;
@@ -6831,6 +6832,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
+ found_skip = false;
found_saop = false;
found_is_null_op = false;
num_sa_scans = 1;
@@ -6839,15 +6841,38 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
IndexClause *iclause = lfirst_node(IndexClause, lc);
ListCell *lc2;
+ /*
+ * XXX For now we just cost skip scans via generic rules: make a
+ * uniform assumption that there will be 10 primitive index scans per
+ * skipped attribute, relying on the "1/3 of all index pages" cap that
+ * this costing has used since Postgres 17. Also assume that skipping
+ * won't take place for an index that has fewer than 100 pages.
+ *
+ * The current approach to costing leaves much to be desired, but is
+ * at least better than nothing at all (keeping the code as it is on
+ * HEAD just makes testing and review inconvenient).
+ */
if (indexcol != iclause->indexcol)
{
/* Beginning of a new column's quals */
if (!eqQualHere)
- break; /* done if no '=' qual for indexcol */
+ {
+ found_skip = true; /* skip when no '=' qual for indexcol */
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10;
+ }
eqQualHere = false;
indexcol++;
if (indexcol != iclause->indexcol)
- break; /* no quals at all for indexcol */
+ {
+ /* no quals at all for indexcol */
+ found_skip = true;
+ if (index->pages < 100)
+ break;
+ num_sa_scans += 10 * (iclause->indexcol - indexcol);
+ continue;
+ }
}
/* Examine each indexqual associated with this index clause */
@@ -6920,6 +6945,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (index->unique &&
indexcol == index->nkeycolumns - 1 &&
eqQualHere &&
+ !found_skip &&
!found_saop &&
!found_is_null_op)
numIndexTuples = 1.0;
diff --git a/src/backend/utils/adt/skipsupport.c b/src/backend/utils/adt/skipsupport.c
new file mode 100644
index 000000000..d91471e26
--- /dev/null
+++ b/src/backend/utils/adt/skipsupport.c
@@ -0,0 +1,60 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.c
+ * Support routines for B-Tree skip scan.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/skipsupport.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "utils/lsyscache.h"
+#include "utils/skipsupport.h"
+
+/*
+ * Fill in SkipSupport given an operator class (opfamily + opcintype).
+ *
+ * On success, returns true, and initializes all SkipSupport fields for
+ * caller. Otherwise returns false, indicating that operator class has no
+ * skip support function.
+ */
+bool
+PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype, bool reverse,
+ SkipSupport sksup)
+{
+ Oid skipSupportFunction;
+
+ /* Look for a skip support function */
+ skipSupportFunction = get_opfamily_proc(opfamily, opcintype, opcintype,
+ BTSKIPSUPPORT_PROC);
+ if (!OidIsValid(skipSupportFunction))
+ return false;
+
+ OidFunctionCall1(skipSupportFunction, PointerGetDatum(sksup));
+
+ if (reverse)
+ {
+ /*
+ * DESC/reverse case: swap low_elem with high_elem, and swap decrement
+ * with increment
+ */
+ Datum low_elem = sksup->low_elem;
+ SkipSupportIncDec decrement = sksup->decrement;
+
+ sksup->low_elem = sksup->high_elem;
+ sksup->decrement = sksup->increment;
+
+ sksup->high_elem = low_elem;
+ sksup->increment = decrement;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c
index 45eb1b2fe..e2d98a62f 100644
--- a/src/backend/utils/adt/uuid.c
+++ b/src/backend/utils/adt/uuid.c
@@ -13,12 +13,15 @@
#include "postgres.h"
+#include <limits.h>
+
#include "common/hashfn.h"
#include "lib/hyperloglog.h"
#include "libpq/pqformat.h"
#include "port/pg_bswap.h"
#include "utils/fmgrprotos.h"
#include "utils/guc.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#include "utils/timestamp.h"
#include "utils/uuid.h"
@@ -390,6 +393,70 @@ uuid_abbrev_convert(Datum original, SortSupport ssup)
return res;
}
+static Datum
+uuid_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ *underflow = false;
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] > 0)
+ {
+ uuid->data[i]--;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = UCHAR_MAX;
+ }
+
+ *underflow = true;
+
+ return 0;
+}
+
+static Datum
+uuid_increment(Relation rel, Datum existing, bool *overflow)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ *overflow = false;
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] < UCHAR_MAX)
+ {
+ uuid->data[i]++;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = 0;
+ }
+
+ *overflow = true;
+
+ return 0;
+}
+
+Datum
+uuid_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+ pg_uuid_t *uuid_min = palloc(UUID_LEN);
+ pg_uuid_t *uuid_max = palloc(UUID_LEN);
+
+ memset(uuid_min->data, 0x00, UUID_LEN);
+ memset(uuid_max->data, 0xFF, UUID_LEN);
+
+ sksup->decrement = uuid_decrement;
+ sksup->increment = uuid_increment;
+ sksup->low_elem = UUIDPGetDatum(uuid_min);
+ sksup->high_elem = UUIDPGetDatum(uuid_max);
+
+ PG_RETURN_VOID();
+}
+
/* hash index support */
Datum
uuid_hash(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c0a52cdcc..131b23a8e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/nbtree.h"
#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
@@ -1753,6 +1754,17 @@ struct config_bool ConfigureNamesBool[] =
},
#endif
+ /* XXX Remove before commit */
+ {
+ {"skipscan_skipsupport_enabled", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_skipsupport_enabled,
+ true,
+ NULL, NULL, NULL
+ },
+
{
{"integer_datetimes", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows whether datetimes are integer based."),
@@ -3587,6 +3599,17 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ /* XXX Remove before commit */
+ {
+ {"skipscan_prefix_cols", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_prefix_cols,
+ INDEX_MAX_KEYS, 0, INDEX_MAX_KEYS,
+ NULL, NULL, NULL
+ },
+
{
/* Can't be set in postgresql.conf */
{"server_version_num", PGC_INTERNAL, PRESET_OPTIONS,
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 2b3997988..9662fb2ba 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -583,6 +583,19 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</para>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><function>skipsupport</function></term>
+ <listitem>
+ <para>
+ Optionally, a btree operator family may provide a <firstterm>skip
+ support</firstterm> function, registered under support function
+ number 6. These functions allow the B-tree code to more efficiently
+ navigate the index structure via an index <quote>skip scan</quote>. The
+ APIs involved in this are defined in
+ <filename>src/include/utils/skipsupport.h</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</sect2>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 6d731e070..651f9323e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -457,23 +457,26 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
<para>
A multicolumn B-tree index can be used with query conditions that
involve any subset of the index's columns, but the index is most
- efficient when there are constraints on the leading (leftmost) columns.
- The exact rule is that equality constraints on leading columns, plus
- any inequality constraints on the first column that does not have an
- equality constraint, will be used to limit the portion of the index
- that is scanned. Constraints on columns to the right of these columns
- are checked in the index, so they save visits to the table proper, but
- they do not reduce the portion of the index that has to be scanned.
+ efficient when there are equality constraints on the leading (leftmost) columns.
+ B-Tree index scans can use the index skip scan strategy to generate
+ equality constraints on prefix columns that were wholly omitted from the
+ query predicate, as well as prefix columns whose values were constrained by
+ inequality conditions.
For example, given an index on <literal>(a, b, c)</literal> and a
query condition <literal>WHERE a = 5 AND b >= 42 AND c < 77</literal>,
the index would have to be scanned from the first entry with
<literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last entry with
- <literal>a</literal> = 5. Index entries with <literal>c</literal> >= 77 would be
- skipped, but they'd still have to be scanned through.
+ <literal>a</literal> = 5. Intevening groups of index entries with
+ <literal>c</literal> >= 77 would not need to be returned by the scan,
+ and can be skipped over entirely by applying the skip scan strategy.
This index could in principle be used for queries that have constraints
on <literal>b</literal> and/or <literal>c</literal> with no constraint on <literal>a</literal>
- — but the entire index would have to be scanned, so in most cases
- the planner would prefer a sequential table scan over using the index.
+ — but that approach is generally only taken when there are so few
+ distinct <literal>a</literal> values that the planner expects the skip scan
+ strategy to allow the scan to skip over most individual index leaf pages.
+ If there are many distinct <literal>a</literal> values, then the entire
+ index will have to be scanned, so in most cases the planner ill prefer a
+ sequential table scan over using the index.
</para>
<para>
@@ -508,10 +511,7 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
</para>
<para>
- Multicolumn indexes should be used sparingly. In most situations,
- an index on a single column is sufficient and saves space and time.
- Indexes with more than three columns are unlikely to be helpful
- unless the usage of the table is extremely stylized. See also
+ Multicolumn indexes should be used judiciously. See
<xref linkend="indexes-bitmap-scans"/> and
<xref linkend="indexes-index-only-scans"/> for some discussion of the
merits of different index configurations.
@@ -669,9 +669,13 @@ CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);
multicolumn index on <literal>(x, y)</literal>. This index would typically be
more efficient than index combination for queries involving both
columns, but as discussed in <xref linkend="indexes-multicolumn"/>, it
- would be almost useless for queries involving only <literal>y</literal>, so it
- should not be the only index. A combination of the multicolumn index
- and a separate index on <literal>y</literal> would serve reasonably well. For
+ would be less useful for queries involving only <literal>y</literal>. Just
+ how useful might depend on how effective the B-Tree index skip scan
+ optimization is; if <literal>x</literal> has no more than several hundred
+ distinct values, skip scan will make searches for specific
+ <literal>y</literal> values execute reasonably efficiently. A combination
+ of a multicolumn index on <literal>(x, y)</literal> and a separate index on
+ <literal>y</literal> might also serve reasonably well. For
queries involving only <literal>x</literal>, the multicolumn index could be
used, though it would be larger and hence slower than an index on
<literal>x</literal> alone. The last alternative is to create all three
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 22d8ad1aa..63f03f3a7 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -461,6 +461,13 @@
</entry>
<entry>5</entry>
</row>
+ <row>
+ <entry>
+ Return the addresses of C-callable skip support function(s)
+ (optional)
+ </entry>
+ <entry>6</entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -1056,7 +1063,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
FUNCTION 1 btint8cmp(int8, int8) ,
FUNCTION 2 btint8sortsupport(internal) ,
FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint8skipsupport(internal) ;
CREATE OPERATOR CLASS int4_ops
DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -1069,7 +1077,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
FUNCTION 1 btint4cmp(int4, int4) ,
FUNCTION 2 btint4sortsupport(internal) ,
FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint4skipsupport(internal) ;
CREATE OPERATOR CLASS int2_ops
DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1082,7 +1091,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
FUNCTION 1 btint2cmp(int2, int2) ,
FUNCTION 2 btint2sortsupport(internal) ,
FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint2skipsupport(internal) ;
ALTER OPERATOR FAMILY integer_ops USING btree ADD
-- cross-type comparisons int8 vs int2
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index ae54cb254..8b6b775c1 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -362,9 +362,9 @@ ERROR: invalid operator number 0, must be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ERROR: operator argument types must be specified in ALTER OPERATOR FAMILY
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ERROR: invalid function number 0, must be between 1 and 5
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR: invalid function number 6, must be between 1 and 5
+ERROR: invalid function number 0, must be between 1 and 6
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
+ERROR: invalid function number 7, must be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
ERROR: STORAGE cannot be specified in ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 3bbe4c5f9..a8d5be6c1 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5138,9 +5138,10 @@ List of access methods
btree | uuid_ops | uuid | uuid | 1 | uuid_cmp
btree | uuid_ops | uuid | uuid | 2 | uuid_sortsupport
btree | uuid_ops | uuid | uuid | 4 | btequalimage
+ btree | uuid_ops | uuid | uuid | 6 | uuid_skipsupport
hash | uuid_ops | uuid | uuid | 1 | uuid_hash
hash | uuid_ops | uuid | uuid | 2 | uuid_hash_extended
-(5 rows)
+(6 rows)
-- check \dconfig
set work_mem = 10240;
diff --git a/src/test/regress/sql/alter_generic.sql b/src/test/regress/sql/alter_generic.sql
index de58d268d..4246afefd 100644
--- a/src/test/regress/sql/alter_generic.sql
+++ b/src/test/regress/sql/alter_generic.sql
@@ -310,7 +310,7 @@ ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 6 < (int4, int2); -- ope
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 0 < (int4, int2); -- operator number should be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 547d14b3e..3dffb3856 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -218,6 +218,7 @@ BTScanPos
BTScanPosData
BTScanPosItem
BTShared
+BTSkipPreproc
BTSortArrayContext
BTSpool
BTStack
@@ -2660,6 +2661,8 @@ SingleBoundSortItem
SinglePartitionSpec
Size
SkipPages
+SkipSupport
+SkipSupportData
SlabBlock
SlabContext
SlabSlot
--
2.45.2
v5-0002-Refactor-handling-of-nbtree-array-redundancies.patchapplication/octet-stream; name=v5-0002-Refactor-handling-of-nbtree-array-redundancies.patchDownload
From 02143dea1394cf69c0ddd825d35491d776de9b82 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 8 Aug 2024 15:41:18 -0400
Subject: [PATCH v5 2/3] Refactor handling of nbtree array redundancies.
Rather than allocating memory for so.keyData[] at the start of each
btrescan, lazily allocate space later on, in _bt_preprocess_keys. We
now allocate so.keyData[] after _bt_preprocess_array_keys is done
performing initial array related preprocessing.
An immediate benefit of this approach is that _bt_preprocess_array_keys
no longer needs to explicitly mark redundant array scan keys. Other
code (_bt_preprocess_keys and its other subsidiary routines) no longer
have to interpret the scan key entries as redundant. Redundant array
scan keys simply never appear in the _bt_preprocess_keys input array
(_bt_preprocess_array_keys removes them up front).
This refactoring is also preparation for an upcoming patch that will add
skip scan optimizations to nbtree. _bt_preprocess_array_keys will be
taught to add new skip array scan keys to the _bt_preprocess_keys input
array (i.e. to arrayKeyData), so doing things this way avoids uselessly
palloc'ing so.keyData[], only to have to repalloc (to enlarge the array)
almost immediately afterwards. This scheme allows _bt_preprocess_keys
to output a so.keyData[] scan key array that can be larger than the
original scan.keyData[] input array, due to the addition of skip array
scan keys within _bt_preprocess_array_keys.
---
src/backend/access/nbtree/nbtree.c | 10 +-
src/backend/access/nbtree/nbtutils.c | 157 +++++++++++++--------------
2 files changed, 83 insertions(+), 84 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e5ce129cc..964f6c73a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -324,11 +324,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
BTScanPosInvalidate(so->currPos);
BTScanPosInvalidate(so->markPos);
- if (scan->numberOfKeys > 0)
- so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
- else
- so->keyData = NULL;
+ so->keyData = NULL;
so->needPrimScan = false;
so->scanBehind = false;
so->oppoDirCheck = false;
@@ -410,6 +407,11 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->numberOfKeys = 0; /* until _bt_preprocess_keys sets it */
so->numArrayKeys = 0; /* ditto */
+
+ /* Release private storage allocated in previous btrescan, if any */
+ if (so->keyData != NULL)
+ pfree(so->keyData);
+ so->keyData = NULL;
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1b39d8701..7fa977a62 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -62,7 +62,7 @@ static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
ScanKey arraysk, ScanKey skey,
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok);
-static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys);
static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
@@ -251,9 +251,6 @@ _bt_freestack(BTStack stack)
* It is convenient for _bt_preprocess_keys caller to have to deal with no
* more than one equality strategy array scan key per index attribute. We'll
* always be able to set things up that way when complete opfamilies are used.
- * Eliminated array scan keys can be recognized as those that have had their
- * sk_strategy field set to InvalidStrategy here by us. Caller should avoid
- * including these in the scan's so->keyData[] output array.
*
* We set the scan key references from the scan's BTArrayKeyInfo info array to
* offsets into the temp modified input array returned to caller. Scans that
@@ -261,18 +258,25 @@ _bt_freestack(BTStack stack)
* preprocessing steps are complete. This will convert the scan key offset
* references into references to the scan's so->keyData[] output scan keys.
*
+ * Caller must pass *numberOfKeys to give us a way to change the number of
+ * input scan keys (our output is caller's input). The returned array can be
+ * smaller than scan->keyData[] when we eliminated a redundant array scan key
+ * (redundant with some other array scan key, for the same attribute). Caller
+ * uses this to allocate so->keyData[] for the current btrescan.
+ *
* Note: the reason we need to return a temp scan key array, rather than just
* scribbling on scan->keyData, is that callers are permitted to call btrescan
* without supplying a new set of scankey data.
*/
static ScanKey
-_bt_preprocess_array_keys(IndexScanDesc scan)
+_bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- int numberOfKeys = scan->numberOfKeys;
+ int numArrayKeyData = scan->numberOfKeys;
int16 *indoption = rel->rd_indoption;
- int numArrayKeys;
+ int numArrayKeys,
+ output_ikey = 0;
int origarrayatt = InvalidAttrNumber,
origarraykey = -1;
Oid origelemtype = InvalidOid;
@@ -280,11 +284,11 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
MemoryContext oldContext;
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- Assert(numberOfKeys);
+ Assert(scan->numberOfKeys);
/* Quick check to see if there are any array keys */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int i = 0; i < scan->numberOfKeys; i++)
{
cur = &scan->keyData[i];
if (cur->sk_flags & SK_SEARCHARRAY)
@@ -317,19 +321,18 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
oldContext = MemoryContextSwitchTo(so->arrayContext);
- /* Create modifiable copy of scan->keyData in the workspace context */
- arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
- memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
+ /* Create output scan keys in the workspace context */
+ arrayKeyData = (ScanKey) palloc(numArrayKeyData * sizeof(ScanKeyData));
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
/* Allocate space for ORDER procs used to help _bt_checkkeys */
- so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+ so->orderProcs = (FmgrInfo *) palloc(numArrayKeyData * sizeof(FmgrInfo));
/* Now process each array key */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int input_ikey = 0; input_ikey < scan->numberOfKeys; input_ikey++)
{
FmgrInfo sortproc;
FmgrInfo *sortprocp = &sortproc;
@@ -345,14 +348,21 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int num_nonnulls;
int j;
- cur = &arrayKeyData[i];
+ /*
+ * Copy input scan key into temp arrayKeyData scan key array. (From
+ * here on, cur points at our copy of the input scan key.)
+ */
+ cur = &arrayKeyData[output_ikey];
+ *cur = scan->keyData[input_ikey];
+
if (!(cur->sk_flags & SK_SEARCHARRAY))
+ {
+ output_ikey++; /* keep this non-array scan key */
continue;
+ }
/*
- * First, deconstruct the array into elements. Anything allocated
- * here (including a possibly detoasted array value) is in the
- * workspace context.
+ * Deconstruct the array into elements
*/
arrayval = DatumGetArrayTypeP(cur->sk_argument);
/* We could cache this data, but not clear it's worth it */
@@ -406,6 +416,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTGreaterStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
case BTEqualStrategyNumber:
/* proceed with rest of loop */
@@ -416,6 +427,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTLessStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
default:
elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -432,7 +444,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* sortproc just points to the same proc used during binary searches.
*/
_bt_setup_array_cmp(scan, cur, elemtype,
- &so->orderProcs[i], &sortprocp);
+ &so->orderProcs[output_ikey], &sortprocp);
/*
* Sort the non-null elements and eliminate any duplicates. We must
@@ -476,11 +488,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
break;
}
- /*
- * Indicate to _bt_preprocess_keys caller that it must ignore
- * this scan key
- */
- cur->sk_strategy = InvalidStrategy;
+ /* Throw away this scan key/array */
continue;
}
@@ -511,12 +519,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* Note: _bt_preprocess_array_keys_final will fix-up each array's
* scan_key field later on, after so->keyData[] has been finalized.
*/
- so->arrayKeys[numArrayKeys].scan_key = i;
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
numArrayKeys++;
+ output_ikey++; /* keep this scan key/array */
}
+ /* Set final number of arrayKeyData[] keys, array keys */
+ *numberOfKeys = output_ikey;
so->numArrayKeys = numArrayKeys;
MemoryContextSwitchTo(oldContext);
@@ -2429,10 +2440,12 @@ end_toplevel_scan:
/*
* _bt_preprocess_keys() -- Preprocess scan keys
*
+ * The first call here (per btrescan) allocates so->keyData[].
* The given search-type keys (taken from scan->keyData[])
* are copied to so->keyData[] with possible transformation.
* scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
- * the number of output keys (possibly less, never greater).
+ * the number of output keys. Calling here a second or subsequent time
+ * (during the same btrescan) is a no-op.
*
* The output keys are marked with additional sk_flags bits beyond the
* system-standard bits supplied by the caller. The DESC and NULLS_FIRST
@@ -2519,9 +2532,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
int16 *indoption = scan->indexRelation->rd_indoption;
int new_numberOfKeys;
int numberOfEqualCols;
- ScanKey inkeys;
- ScanKey outkeys;
- ScanKey cur;
+ ScanKey inputsk;
BTScanKeyPreproc xform[BTMaxStrategyNumber];
bool test_result;
int i,
@@ -2553,7 +2564,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
return; /* done if qual-less scan */
/* If any keys are SK_SEARCHARRAY type, set up array-key info */
- arrayKeyData = _bt_preprocess_array_keys(scan);
+ arrayKeyData = _bt_preprocess_array_keys(scan, &numberOfKeys);
if (!so->qual_ok)
{
/* unmatchable array, so give up */
@@ -2567,32 +2578,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
if (arrayKeyData)
{
- inkeys = arrayKeyData;
+ inputsk = arrayKeyData;
/* Also maintain keyDataMap for remapping so->orderProc[] later */
keyDataMap = MemoryContextAlloc(so->arrayContext,
numberOfKeys * sizeof(int));
}
else
- inkeys = scan->keyData;
+ inputsk = scan->keyData;
+
+ /*
+ * Now that we have an estimate of the number of output scan keys,
+ * allocate space for them
+ */
+ so->keyData = palloc(sizeof(ScanKeyData) * numberOfKeys);
- outkeys = so->keyData;
- cur = &inkeys[0];
/* we check that input keys are correctly ordered */
- if (cur->sk_attno < 1)
+ if (inputsk[0].sk_attno < 1)
elog(ERROR, "btree index keys must be ordered by attribute");
/* We can short-circuit most of the work if there's just one key */
if (numberOfKeys == 1)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
so->qual_ok = false;
- memcpy(outkeys, cur, sizeof(ScanKeyData));
+ memcpy(&so->keyData[0], &inputsk[0], sizeof(ScanKeyData));
so->numberOfKeys = 1;
/* We can mark the qual as required if it's for first index col */
- if (cur->sk_attno == 1)
- _bt_mark_scankey_required(outkeys);
+ if (inputsk[0].sk_attno == 1)
+ _bt_mark_scankey_required(&so->keyData[0]);
if (arrayKeyData)
{
/*
@@ -2600,8 +2615,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
* (we'll miss out on the single value array transformation, but
* that's not nearly as important when there's only one scan key)
*/
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+ Assert(so->keyData[0].sk_flags & SK_SEARCHARRAY);
+ Assert(so->keyData[0].sk_strategy != BTEqualStrategyNumber ||
(so->arrayKeys[0].scan_key == 0 &&
OidIsValid(so->orderProcs[0].fn_oid)));
}
@@ -2629,12 +2644,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* handle after-last-key processing. Actual exit from the loop is at the
* "break" statement below.
*/
- for (i = 0;; cur++, i++)
+ for (i = 0;; inputsk++, i++)
{
if (i < numberOfKeys)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
{
/* NULL can't be matched, so give up */
so->qual_ok = false;
@@ -2646,12 +2661,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* If we are at the end of the keys for a particular attr, finish up
* processing and emit the cleaned-up keys.
*/
- if (i == numberOfKeys || cur->sk_attno != attno)
+ if (i == numberOfKeys || inputsk->sk_attno != attno)
{
int priorNumberOfEqualCols = numberOfEqualCols;
/* check input keys are correctly ordered */
- if (i < numberOfKeys && cur->sk_attno < attno)
+ if (i < numberOfKeys && inputsk->sk_attno < attno)
elog(ERROR, "btree index keys must be ordered by attribute");
/*
@@ -2755,7 +2770,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
}
/*
- * Emit the cleaned-up keys into the outkeys[] array, and then
+ * Emit the cleaned-up keys into the so->keyData[] array, and then
* mark them if they are required. They are required (possibly
* only in one direction) if all attrs before this one had "=".
*/
@@ -2763,7 +2778,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
{
if (xform[j].skey)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
@@ -2780,19 +2795,19 @@ _bt_preprocess_keys(IndexScanDesc scan)
break;
/* Re-initialize for new attno */
- attno = cur->sk_attno;
+ attno = inputsk->sk_attno;
memset(xform, 0, sizeof(xform));
}
/* check strategy this key's operator corresponds to */
- j = cur->sk_strategy - 1;
+ j = inputsk->sk_strategy - 1;
/* if row comparison, push it directly to the output array */
- if (cur->sk_flags & SK_ROW_HEADER)
+ if (inputsk->sk_flags & SK_ROW_HEADER)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
- memcpy(outkey, cur, sizeof(ScanKeyData));
+ memcpy(outkey, inputsk, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = i;
if (numberOfEqualCols == attno - 1)
@@ -2806,21 +2821,10 @@ _bt_preprocess_keys(IndexScanDesc scan)
continue;
}
- /*
- * Does this input scan key require further processing as an array?
- */
- if (cur->sk_strategy == InvalidStrategy)
+ if (inputsk->sk_strategy == BTEqualStrategyNumber &&
+ (inputsk->sk_flags & SK_SEARCHARRAY))
{
- /* _bt_preprocess_array_keys marked this array key redundant */
- Assert(arrayKeyData);
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- continue;
- }
-
- if (cur->sk_strategy == BTEqualStrategyNumber &&
- (cur->sk_flags & SK_SEARCHARRAY))
- {
- /* _bt_preprocess_array_keys kept this array key */
+ /* maintain arrayidx for xform[] array */
Assert(arrayKeyData);
arrayidx++;
}
@@ -2832,7 +2836,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (xform[j].skey == NULL)
{
/* nope, so this scan key wins by default (at least for now) */
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2850,7 +2854,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
/*
* Have to set up array keys
*/
- if ((cur->sk_flags & SK_SEARCHARRAY))
+ if (inputsk->sk_flags & SK_SEARCHARRAY)
{
array = &so->arrayKeys[arrayidx - 1];
orderproc = so->orderProcs + i;
@@ -2878,7 +2882,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
}
- if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+ if (_bt_compare_scankey_args(scan, inputsk, inputsk, xform[j].skey,
array, orderproc, &test_result))
{
/* Have all we need to determine redundancy */
@@ -2892,7 +2896,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2905,7 +2909,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
* scan key. _bt_compare_scankey_args expects us to
* always keep arrays (and discard non-arrays).
*/
- Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+ Assert(!(inputsk->sk_flags & SK_SEARCHARRAY));
}
}
else if (j == (BTEqualStrategyNumber - 1))
@@ -2928,14 +2932,14 @@ _bt_preprocess_keys(IndexScanDesc scan)
* even with incomplete opfamilies. _bt_advance_array_keys
* depends on this.
*/
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
if (numberOfEqualCols == attno - 1)
_bt_mark_scankey_required(outkey);
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -3349,13 +3353,6 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
return true;
}
- if (skey->sk_strategy == InvalidStrategy)
- {
- /* Already-eliminated array scan key; don't need to fix anything */
- Assert(skey->sk_flags & SK_SEARCHARRAY);
- return true;
- }
-
/* Adjust strategy for DESC, if we didn't already */
if ((addflags & SK_BT_DESC) && !(skey->sk_flags & SK_BT_DESC))
skey->sk_strategy = BTCommuteStrategyNumber(skey->sk_strategy);
--
2.45.2
On Mon, Jul 15, 2024 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Jul 12, 2024 at 1:19 AM <Masahiro.Ikeda@nttdata.com> wrote:
I found the cost is estimated to much higher if the number of skipped attributes
is more than two. Is it expected behavior?Yes and no.
Honestly, the current costing is just placeholder code. It is totally
inadequate. I'm not surprised that you found problems with it. I just
didn't put much work into it, because I didn't really know what to do.
Attached is v6, which finally does something sensible in btcostestimate.
v6 is also the first version that supports parallel index scans that
can skip. This works by extending the approach taken by scans with
regular SAOP arrays to work with skip arrays. We need to serialize and
deserialize the current array keys in shared memory, as datums -- we
cannot just use simple BTArrayKeyInfo.cur_elem offsets with skip
arrays.
v6 also includes the patch that shows "Index Searches" in EXPLAIN
ANALYZE output, just because it's convenient when testing the patch.
This has been independently submitted as
https://commitfest.postgresql.org/49/5183/, so probably doesn't need
review here.
v6 is the first version of the patch that is basically feature
complete. I only have one big open item left: I must still fix certain
regressions seen with queries that are very unfavorable for skip scan,
where the CPU cost (but not I/O cost) of maintaining skip arrays slows
things down. Overall, I'm making fast progress here.
Back to the topic of the btcostestimate/planner changes. The rest of
the email is a discussion of the cost model.
The planner changes probably still have some problems, but all of the
obvious problems have been fixed by v6. I found it useful to focus on
making the cost model not have any obvious problems instead of trying
to make it match a purely theoretical ideal. For example, your
(Ikeda-san's) complaint about the "Index Scan using idx_id1_id2_id3 on
public.test" test case having too high a cost (higher than the cost of
a slower sequential scan) has been fixed. It's now about 3x cheaper
than the sequential scan, since we're actually paying attention to
ndistinct in v6.
Just like when we cost SAOP arrays on HEAD, skip arrays are costed by
pessimistically multiplying together the estimated number of array
elements for all the scan's arrays, without trying to account for
correlation between index columns. Being pessimistic about
correlations like this is often wrong, but that still seems like the
best bias we could have, all things considered. Plus it's nothing new.
Range style skip arrays require a slightly more complicated approach
to estimating the number of array elements: costing applies a
selectivity estimate, taken from the associated index column's
inequality keys, and applies that estimate to ndistinct itself. That
way the cost of a range skip array is lower than an
otherwise-equivalent simple skip array case (we prorate ndistinct with
skip arrays). More importantly, the cost of more selectivity ranges is
lower than the cost of less selective ranges. There is also a bias
here: we don't account for skew in ndistinct. That's probably OK,
because at least it's a bias *against* skip scan.
The new cost model does not specifically try to account for how scans
will behave when no skipping should be expected at all -- cases where
a so-called "skip scan" degenerates into a full index scan. In theory,
we should be costing these scans the same as before, since there has
been no change in runtime behavior. Overall, the cost of full index
scans with very many distinct prefix column values goes down by quite
a bit -- the cost is something like 1/3 lower in typical cases.
The problem with preserving the cost model from HEAD for these
unfavorable cases for skip scan is that I don't feel that I understand
the existing behavior. In practice the revised costing seems to be a
somewhat more accurate predictor of the actual runtime of queries.
Another problem is that I can't see a good way to make the behavior
continuous when ndistinct starts small and grows so large that we
should expect a true full index scan. (As I mentioned at the start of
this email, there are unfixed regressions for these unfavorable cases,
so I'm basing this analysis on the "set skipscan_prefix_cols = 0"
behavior rather than the current default patch behavior to correct for
that. This behavior matches HEAD with a full index scan, and should
match the default behavior in a future version of the skip scan
patch.)
--
Peter Geoghegan
Attachments:
v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patchapplication/octet-stream; name=v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patchDownload
From 20e45ec5a4e9639173be625c4c4b87b86e870397 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 14 Aug 2024 13:50:23 -0400
Subject: [PATCH v6 1/4] Show index search count in EXPLAIN ANALYZE.
Also stop counting the case where nbtree detects contradictory quals as
a distinct index search (do so neither in EXPLAIN ANALYZE nor in the
pg_stat_*_indexes.idx_scan stats).
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com
---
src/include/access/relscan.h | 3 +
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginscan.c | 1 +
src/backend/access/gist/gistget.c | 2 +
src/backend/access/hash/hashsearch.c | 1 +
src/backend/access/index/genam.c | 1 +
src/backend/access/nbtree/nbtree.c | 11 ++++
src/backend/access/nbtree/nbtsearch.c | 9 ++-
src/backend/access/spgist/spgscan.c | 1 +
src/backend/commands/explain.c | 38 +++++++++++++
doc/src/sgml/bloom.sgml | 6 +-
doc/src/sgml/monitoring.sgml | 12 +++-
doc/src/sgml/perform.sgml | 8 +++
doc/src/sgml/ref/explain.sgml | 3 +-
doc/src/sgml/rules.sgml | 1 +
src/test/regress/expected/brin_multi.out | 27 ++++++---
src/test/regress/expected/memoize.out | 50 +++++++++++-----
src/test/regress/expected/partition_prune.out | 57 ++++++++++++++-----
src/test/regress/expected/select.out | 3 +-
src/test/regress/sql/memoize.sql | 6 +-
src/test/regress/sql/partition_prune.sql | 4 ++
21 files changed, 198 insertions(+), 47 deletions(-)
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304..b992d4080 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -130,6 +130,9 @@ typedef struct IndexScanDescData
bool xactStartedInRecovery; /* prevents killing/seeing killed
* tuples */
+ /* index access method instrumentation output state */
+ uint64 nsearches; /* # of index searches */
+
/* index access method's private state */
void *opaque; /* access-method-specific info */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 6467bed60..749d8b845 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -581,6 +581,7 @@ bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
opaque = (BrinOpaque *) scan->opaque;
bdesc = opaque->bo_bdesc;
pgstat_count_index_scan(idxRel);
+ scan->nsearches++;
/*
* We need to know the size of the table so that we know how long to
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d3854..594478116 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -436,6 +436,7 @@ ginNewScanKey(IndexScanDesc scan)
MemoryContextSwitchTo(oldCtx);
pgstat_count_index_scan(scan->indexRelation);
+ scan->nsearches++;
}
void
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index b35b8a975..36f1435cb 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -625,6 +625,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
GISTSearchItem fakeItem;
pgstat_count_index_scan(scan->indexRelation);
+ scan->nsearches++;
so->firstCall = false;
so->curPageData = so->nPageData = 0;
@@ -750,6 +751,7 @@ gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
return 0;
pgstat_count_index_scan(scan->indexRelation);
+ scan->nsearches++;
/* Begin the scan by processing the root page */
so->curPageData = so->nPageData = 0;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 0d99d6abc..927ba1039 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -298,6 +298,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
HashScanPosItem *currItem;
pgstat_count_index_scan(rel);
+ scan->nsearches++;
/*
* We do not support hash scans with no index qualification, because we
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 43c95d610..5f4544724 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -116,6 +116,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
scan->ignore_killed_tuples = !scan->xactStartedInRecovery;
+ scan->nsearches = 0; /* not reset by index_rescan */
scan->opaque = NULL;
scan->xs_itup = NULL;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f..dfef6c12d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -70,6 +70,7 @@ typedef struct BTParallelScanDescData
BTPS_State btps_pageStatus; /* indicates whether next page is
* available for scan. see above for
* possible states of parallel scan. */
+ uint64 btps_nsearches; /* instrumentation */
slock_t btps_mutex; /* protects above variables, btps_arrElems */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
@@ -551,6 +552,7 @@ btinitparallelscan(void *target)
SpinLockInit(&bt_target->btps_mutex);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+ bt_target->btps_nsearches = 0;
ConditionVariableInit(&bt_target->btps_cv);
}
@@ -576,6 +578,7 @@ btparallelrescan(IndexScanDesc scan)
SpinLockAcquire(&btscan->btps_mutex);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+ /* deliberately don't reset btps_nsearches (matches index_rescan) */
SpinLockRelease(&btscan->btps_mutex);
}
@@ -680,6 +683,11 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
* We have successfully seized control of the scan for the purpose
* of advancing it to a new page!
*/
+ if (first && btscan->btps_pageStatus == BTPARALLEL_NOT_INITIALIZED)
+ {
+ /* count the first primitive scan for this btrescan */
+ btscan->btps_nsearches++;
+ }
btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
*pageno = btscan->btps_scanPage;
exit_loop = true;
@@ -752,6 +760,8 @@ _bt_parallel_done(IndexScanDesc scan)
btscan->btps_pageStatus = BTPARALLEL_DONE;
status_changed = true;
}
+ /* Copy the authoritative shared primitive scan counter to local field */
+ scan->nsearches = btscan->btps_nsearches;
SpinLockRelease(&btscan->btps_mutex);
/* wake up all the workers associated with this parallel scan */
@@ -785,6 +795,7 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
{
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
+ btscan->btps_nsearches++;
/* Serialize scan's current array keys */
for (int i = 0; i < so->numArrayKeys; i++)
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2551df8a6..4b91a192e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -896,8 +896,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(!BTScanPosIsValid(so->currPos));
- pgstat_count_index_scan(rel);
-
/*
* Examine the scan keys and eliminate any redundant keys; also mark the
* keys that must be matched to continue the scan.
@@ -960,6 +958,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
_bt_start_array_keys(scan, dir);
}
+ /*
+ * We've established that we'll either call _bt_search or _bt_endpoint.
+ * Count this as a primitive index scan/index search.
+ */
+ pgstat_count_index_scan(rel);
+ scan->nsearches++;
+
/*----------
* Examine the scan keys to discover where we need to start the scan.
*
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 03293a781..9138fc03a 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -423,6 +423,7 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* count an indexscan for stats */
pgstat_count_index_scan(scan->indexRelation);
+ scan->nsearches++;
}
void
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 11df4a04d..31edc8684 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -13,6 +13,7 @@
*/
#include "postgres.h"
+#include "access/relscan.h"
#include "access/xact.h"
#include "catalog/pg_type.h"
#include "commands/createas.h"
@@ -89,6 +90,7 @@ static void show_plan_tlist(PlanState *planstate, List *ancestors,
static void show_expression(Node *node, const char *qlabel,
PlanState *planstate, List *ancestors,
bool useprefix, ExplainState *es);
+static void show_indexscan_nsearches(PlanState *planstate, ExplainState *es);
static void show_qual(List *qual, const char *qlabel,
PlanState *planstate, List *ancestors,
bool useprefix, ExplainState *es);
@@ -1975,6 +1977,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_IndexScan:
show_scan_qual(((IndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
+ if (es->analyze)
+ show_indexscan_nsearches(planstate, es);
if (((IndexScan *) plan)->indexqualorig)
show_instrumentation_count("Rows Removed by Index Recheck", 2,
planstate, es);
@@ -1988,6 +1992,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
"Index Cond", planstate, ancestors, es);
+ if (es->analyze)
+ show_indexscan_nsearches(planstate, es);
if (((IndexOnlyScan *) plan)->recheckqual)
show_instrumentation_count("Rows Removed by Index Recheck", 2,
planstate, es);
@@ -2004,6 +2010,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
+ if (es->analyze)
+ show_indexscan_nsearches(planstate, es);
break;
case T_BitmapHeapScan:
show_scan_qual(((BitmapHeapScan *) plan)->bitmapqualorig,
@@ -2513,6 +2521,36 @@ show_expression(Node *node, const char *qlabel,
ExplainPropertyText(qlabel, exprstr, es);
}
+/*
+ * Show the number of index searches within an IndexScan node, IndexOnlyScan
+ * node, or BitmapIndexScan node
+ */
+static void
+show_indexscan_nsearches(PlanState *planstate, ExplainState *es)
+{
+ Plan *plan = planstate->plan;
+ struct IndexScanDescData *scanDesc = NULL;
+
+ switch (nodeTag(plan))
+ {
+ case T_IndexScan:
+ scanDesc = ((IndexScanState *) planstate)->iss_ScanDesc;
+ break;
+ case T_IndexOnlyScan:
+ scanDesc = ((IndexOnlyScanState *) planstate)->ioss_ScanDesc;
+ break;
+ case T_BitmapIndexScan:
+ scanDesc = ((BitmapIndexScanState *) planstate)->biss_ScanDesc;
+ break;
+ default:
+ break;
+ }
+
+ if (scanDesc && scanDesc->nsearches > 0)
+ ExplainPropertyUInteger("Index Searches", NULL,
+ scanDesc->nsearches, es);
+}
+
/*
* Show a qualifier expression (which is a List with implicit AND semantics)
*/
diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml
index 19f2b172c..92b13f539 100644
--- a/doc/src/sgml/bloom.sgml
+++ b/doc/src/sgml/bloom.sgml
@@ -170,9 +170,10 @@ CREATE INDEX
Heap Blocks: exact=28
-> Bitmap Index Scan on bloomidx (cost=0.00..1792.00 rows=2 width=0) (actual time=0.356..0.356 rows=29 loops=1)
Index Cond: ((i2 = 898732) AND (i5 = 123451))
+ Index Searches: 1
Planning Time: 0.099 ms
Execution Time: 0.408 ms
-(8 rows)
+(9 rows)
</programlisting>
</para>
@@ -202,11 +203,12 @@ CREATE INDEX
-> BitmapAnd (cost=24.34..24.34 rows=2 width=0) (actual time=0.027..0.027 rows=0 loops=1)
-> Bitmap Index Scan on btreeidx5 (cost=0.00..12.04 rows=500 width=0) (actual time=0.026..0.026 rows=0 loops=1)
Index Cond: (i5 = 123451)
+ Index Searches: 1
-> Bitmap Index Scan on btreeidx2 (cost=0.00..12.04 rows=500 width=0) (never executed)
Index Cond: (i2 = 898732)
Planning Time: 0.491 ms
Execution Time: 0.055 ms
-(9 rows)
+(10 rows)
</programlisting>
Although this query runs much faster than with either of the single
indexes, we pay a penalty in index size. Each of the single-column
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 55417a6fa..487851994 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4077,12 +4077,18 @@ description | Waiting for a newly initialized WAL file to reach durable storage
Queries that use certain <acronym>SQL</acronym> constructs to search for
rows matching any value out of a list or array of multiple scalar values
(see <xref linkend="functions-comparisons"/>) perform multiple
- <quote>primitive</quote> index scans (up to one primitive scan per scalar
- value) during query execution. Each internal primitive index scan
- increments <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>,
+ index searches (up to one index search per scalar value) during query
+ execution. Each internal index search increments
+ <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>,
so it's possible for the count of index scans to significantly exceed the
total number of index scan executor node executions.
</para>
+ <para>
+ <command>EXPLAIN ANALYZE</command> breaks down the total number of index
+ searches performed by each index scan node. <literal>Index Searches: N</literal>
+ indicates the total number of searches across <emphasis>all</emphasis>
+ executor node executions/loops.
+ </para>
</note>
</sect2>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ff689b652..1f2172960 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -702,8 +702,10 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
Heap Blocks: exact=10
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.004..0.004 rows=10 loops=1)
Index Cond: (unique1 < 10)
+ Index Searches: 1
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1 loops=10)
Index Cond: (unique2 = t1.unique2)
+ Index Searches: 1
Planning Time: 0.485 ms
Execution Time: 0.073 ms
</screen>
@@ -754,6 +756,7 @@ WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2 ORDER BY t1.fivethous;
Heap Blocks: exact=90
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=100 width=0) (actual time=0.013..0.013 rows=100 loops=1)
Index Cond: (unique1 < 100)
+ Index Searches: 1
Planning Time: 0.187 ms
Execution Time: 3.036 ms
</screen>
@@ -819,6 +822,7 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
-------------------------------------------------------------------&zwsp;-------------------------------------------------------
Index Scan using gpolygonind on polygon_tbl (cost=0.13..8.15 rows=1 width=85) (actual time=0.074..0.074 rows=0 loops=1)
Index Cond: (f1 @> '((0.5,2))'::polygon)
+ Index Searches: 1
Rows Removed by Index Recheck: 1
Planning Time: 0.039 ms
Execution Time: 0.098 ms
@@ -848,9 +852,11 @@ EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tenk1 WHERE unique1 < 100 AND unique
Buffers: shared hit=4 read=3
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=100 width=0) (actual time=0.027..0.027 rows=100 loops=1)
Index Cond: (unique1 < 100)
+ Index Searches: 1
Buffers: shared hit=2
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78 rows=999 width=0) (actual time=0.070..0.070 rows=999 loops=1)
Index Cond: (unique2 > 9000)
+ Index Searches: 1
Buffers: shared hit=2 read=3
Planning:
Buffers: shared hit=3
@@ -883,6 +889,7 @@ EXPLAIN ANALYZE UPDATE tenk1 SET hundred = hundred + 1 WHERE unique1 < 100;
Heap Blocks: exact=90
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=100 width=0) (actual time=0.031..0.031 rows=100 loops=1)
Index Cond: (unique1 < 100)
+ Index Searches: 1
Planning Time: 0.151 ms
Execution Time: 1.856 ms
@@ -1017,6 +1024,7 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2 loops=1)
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2 loops=1)
Index Cond: (unique2 > 9000)
+ Index Searches: 1
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Planning Time: 0.077 ms
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index db9d3a854..e042638b7 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -502,9 +502,10 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
Batches: 1 Memory Usage: 24kB
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99 loops=1)
Index Cond: ((id > 100) AND (id < 200))
+ Index Searches: 1
Planning Time: 0.244 ms
Execution Time: 0.073 ms
-(7 rows)
+(8 rows)
</programlisting>
</para>
diff --git a/doc/src/sgml/rules.sgml b/doc/src/sgml/rules.sgml
index 7a928bd7b..7a00e4c0e 100644
--- a/doc/src/sgml/rules.sgml
+++ b/doc/src/sgml/rules.sgml
@@ -1045,6 +1045,7 @@ SELECT count(*) FROM words WHERE word = 'caterpiler';
Aggregate (cost=4.44..4.45 rows=1 width=0) (actual time=0.042..0.042 rows=1 loops=1)
-> Index Only Scan using wrd_word on wrd (cost=0.42..4.44 rows=1 width=0) (actual time=0.039..0.039 rows=0 loops=1)
Index Cond: (word = 'caterpiler'::text)
+ Index Searches: 1
Heap Fetches: 0
Planning time: 0.164 ms
Execution time: 0.117 ms
diff --git a/src/test/regress/expected/brin_multi.out b/src/test/regress/expected/brin_multi.out
index ae9ce9d8e..c24d56007 100644
--- a/src/test/regress/expected/brin_multi.out
+++ b/src/test/regress/expected/brin_multi.out
@@ -853,7 +853,8 @@ SELECT * FROM brin_date_test WHERE a = '2023-01-01'::date;
Recheck Cond: (a = '2023-01-01'::date)
-> Bitmap Index Scan on brin_date_test_a_idx (actual rows=0 loops=1)
Index Cond: (a = '2023-01-01'::date)
-(4 rows)
+ Index Searches: 1
+(5 rows)
DROP TABLE brin_date_test;
RESET enable_seqscan;
@@ -872,7 +873,8 @@ SELECT * FROM brin_timestamp_test WHERE a = '2023-01-01'::timestamp;
Recheck Cond: (a = '2023-01-01 00:00:00'::timestamp without time zone)
-> Bitmap Index Scan on brin_timestamp_test_a_idx (actual rows=0 loops=1)
Index Cond: (a = '2023-01-01 00:00:00'::timestamp without time zone)
-(4 rows)
+ Index Searches: 1
+(5 rows)
EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF, SUMMARY OFF)
SELECT * FROM brin_timestamp_test WHERE a = '1900-01-01'::timestamp;
@@ -882,7 +884,8 @@ SELECT * FROM brin_timestamp_test WHERE a = '1900-01-01'::timestamp;
Recheck Cond: (a = '1900-01-01 00:00:00'::timestamp without time zone)
-> Bitmap Index Scan on brin_timestamp_test_a_idx (actual rows=0 loops=1)
Index Cond: (a = '1900-01-01 00:00:00'::timestamp without time zone)
-(4 rows)
+ Index Searches: 1
+(5 rows)
DROP TABLE brin_timestamp_test;
RESET enable_seqscan;
@@ -900,7 +903,8 @@ SELECT * FROM brin_date_test WHERE a = '2023-01-01'::date;
Recheck Cond: (a = '2023-01-01'::date)
-> Bitmap Index Scan on brin_date_test_a_idx (actual rows=0 loops=1)
Index Cond: (a = '2023-01-01'::date)
-(4 rows)
+ Index Searches: 1
+(5 rows)
EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF, SUMMARY OFF)
SELECT * FROM brin_date_test WHERE a = '1900-01-01'::date;
@@ -910,7 +914,8 @@ SELECT * FROM brin_date_test WHERE a = '1900-01-01'::date;
Recheck Cond: (a = '1900-01-01'::date)
-> Bitmap Index Scan on brin_date_test_a_idx (actual rows=0 loops=1)
Index Cond: (a = '1900-01-01'::date)
-(4 rows)
+ Index Searches: 1
+(5 rows)
DROP TABLE brin_date_test;
RESET enable_seqscan;
@@ -929,7 +934,8 @@ SELECT * FROM brin_interval_test WHERE a = '-30 years'::interval;
Recheck Cond: (a = '@ 30 years ago'::interval)
-> Bitmap Index Scan on brin_interval_test_a_idx (actual rows=0 loops=1)
Index Cond: (a = '@ 30 years ago'::interval)
-(4 rows)
+ Index Searches: 1
+(5 rows)
EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF, SUMMARY OFF)
SELECT * FROM brin_interval_test WHERE a = '30 years'::interval;
@@ -939,7 +945,8 @@ SELECT * FROM brin_interval_test WHERE a = '30 years'::interval;
Recheck Cond: (a = '@ 30 years'::interval)
-> Bitmap Index Scan on brin_interval_test_a_idx (actual rows=0 loops=1)
Index Cond: (a = '@ 30 years'::interval)
-(4 rows)
+ Index Searches: 1
+(5 rows)
DROP TABLE brin_interval_test;
RESET enable_seqscan;
@@ -957,7 +964,8 @@ SELECT * FROM brin_interval_test WHERE a = '-30 years'::interval;
Recheck Cond: (a = '@ 30 years ago'::interval)
-> Bitmap Index Scan on brin_interval_test_a_idx (actual rows=0 loops=1)
Index Cond: (a = '@ 30 years ago'::interval)
-(4 rows)
+ Index Searches: 1
+(5 rows)
EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF, SUMMARY OFF)
SELECT * FROM brin_interval_test WHERE a = '30 years'::interval;
@@ -967,7 +975,8 @@ SELECT * FROM brin_interval_test WHERE a = '30 years'::interval;
Recheck Cond: (a = '@ 30 years'::interval)
-> Bitmap Index Scan on brin_interval_test_a_idx (actual rows=0 loops=1)
Index Cond: (a = '@ 30 years'::interval)
-(4 rows)
+ Index Searches: 1
+(5 rows)
DROP TABLE brin_interval_test;
RESET enable_seqscan;
diff --git a/src/test/regress/expected/memoize.out b/src/test/regress/expected/memoize.out
index 9ee09fe2f..1448179fb 100644
--- a/src/test/regress/expected/memoize.out
+++ b/src/test/regress/expected/memoize.out
@@ -22,8 +22,10 @@ begin
ln := regexp_replace(ln, 'Evictions: 0', 'Evictions: Zero');
ln := regexp_replace(ln, 'Evictions: \d+', 'Evictions: N');
ln := regexp_replace(ln, 'Memory Usage: \d+', 'Memory Usage: N');
- ln := regexp_replace(ln, 'Heap Fetches: \d+', 'Heap Fetches: N');
- ln := regexp_replace(ln, 'loops=\d+', 'loops=N');
+ ln := regexp_replace(ln, 'Heap Fetches: \d+', 'Heap Fetches: N');
+ ln := regexp_replace(ln, 'loops=\d+', 'loops=N');
+ ln := regexp_replace(ln, 'Index Searches: 0', 'Index Searches: Zero');
+ ln := regexp_replace(ln, 'Index Searches: \d+', 'Index Searches: N');
return next ln;
end loop;
end;
@@ -48,8 +50,9 @@ WHERE t2.unique1 < 1000;', false);
Hits: 980 Misses: 20 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Only Scan using tenk1_unique1 on tenk1 t1 (actual rows=1 loops=N)
Index Cond: (unique1 = t2.twenty)
+ Index Searches: N
Heap Fetches: N
-(12 rows)
+(13 rows)
-- And check we get the expected results.
SELECT COUNT(*),AVG(t1.unique1) FROM tenk1 t1
@@ -79,8 +82,9 @@ WHERE t1.unique1 < 1000;', false);
Hits: 980 Misses: 20 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Only Scan using tenk1_unique1 on tenk1 t2 (actual rows=1 loops=N)
Index Cond: (unique1 = t1.twenty)
+ Index Searches: N
Heap Fetches: N
-(12 rows)
+(13 rows)
-- And check we get the expected results.
SELECT COUNT(*),AVG(t2.unique1) FROM tenk1 t1,
@@ -106,6 +110,7 @@ WHERE t1.unique1 < 10;', false);
-> Nested Loop Left Join (actual rows=20 loops=N)
-> Index Scan using tenk1_unique1 on tenk1 t1 (actual rows=10 loops=N)
Index Cond: (unique1 < 10)
+ Index Searches: N
-> Memoize (actual rows=2 loops=N)
Cache Key: t1.two
Cache Mode: binary
@@ -115,7 +120,8 @@ WHERE t1.unique1 < 10;', false);
Rows Removed by Filter: 2
-> Index Scan using tenk1_unique1 on tenk1 t2_1 (actual rows=4 loops=N)
Index Cond: (unique1 < 4)
-(13 rows)
+ Index Searches: N
+(15 rows)
-- And check we get the expected results.
SELECT COUNT(*),AVG(t2.t1two) FROM tenk1 t1 LEFT JOIN
@@ -146,10 +152,11 @@ WHERE s.c1 = s.c2 AND t1.unique1 < 1000;', false);
Cache Mode: binary
Hits: 998 Misses: 2 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Only Scan using tenk1_unique1 on tenk1 t2 (actual rows=1 loops=N)
+ Index Searches: N
Filter: ((t1.two + 1) = unique1)
Rows Removed by Filter: 9999
Heap Fetches: N
-(13 rows)
+(14 rows)
-- And check we get the expected results.
SELECT COUNT(*), AVG(t1.twenty) FROM tenk1 t1 LEFT JOIN
@@ -217,9 +224,10 @@ ON t1.x = t2.t::numeric AND t1.t::numeric = t2.x;', false);
Hits: 20 Misses: 20 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Only Scan using expr_key_idx_x_t on expr_key t2 (actual rows=2 loops=N)
Index Cond: (x = (t1.t)::numeric)
+ Index Searches: N
Filter: (t1.x = (t)::numeric)
Heap Fetches: N
-(10 rows)
+(11 rows)
DROP TABLE expr_key;
-- Reduce work_mem and hash_mem_multiplier so that we see some cache evictions
@@ -245,8 +253,9 @@ WHERE t2.unique1 < 1200;', true);
Hits: N Misses: N Evictions: N Overflows: 0 Memory Usage: NkB
-> Index Only Scan using tenk1_unique1 on tenk1 t1 (actual rows=1 loops=N)
Index Cond: (unique1 = t2.thousand)
+ Index Searches: N
Heap Fetches: N
-(12 rows)
+(13 rows)
CREATE TABLE flt (f float);
CREATE INDEX flt_f_idx ON flt (f);
@@ -260,6 +269,7 @@ SELECT * FROM flt f1 INNER JOIN flt f2 ON f1.f = f2.f;', false);
-------------------------------------------------------------------------------
Nested Loop (actual rows=4 loops=N)
-> Index Only Scan using flt_f_idx on flt f1 (actual rows=2 loops=N)
+ Index Searches: N
Heap Fetches: N
-> Memoize (actual rows=2 loops=N)
Cache Key: f1.f
@@ -267,8 +277,9 @@ SELECT * FROM flt f1 INNER JOIN flt f2 ON f1.f = f2.f;', false);
Hits: 1 Misses: 1 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Only Scan using flt_f_idx on flt f2 (actual rows=2 loops=N)
Index Cond: (f = f1.f)
+ Index Searches: N
Heap Fetches: N
-(10 rows)
+(12 rows)
-- Ensure memoize operates in binary mode
SELECT explain_memoize('
@@ -277,6 +288,7 @@ SELECT * FROM flt f1 INNER JOIN flt f2 ON f1.f >= f2.f;', false);
-------------------------------------------------------------------------------
Nested Loop (actual rows=4 loops=N)
-> Index Only Scan using flt_f_idx on flt f1 (actual rows=2 loops=N)
+ Index Searches: N
Heap Fetches: N
-> Memoize (actual rows=2 loops=N)
Cache Key: f1.f
@@ -284,8 +296,9 @@ SELECT * FROM flt f1 INNER JOIN flt f2 ON f1.f >= f2.f;', false);
Hits: 0 Misses: 2 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Only Scan using flt_f_idx on flt f2 (actual rows=2 loops=N)
Index Cond: (f <= f1.f)
+ Index Searches: N
Heap Fetches: N
-(10 rows)
+(12 rows)
DROP TABLE flt;
-- Exercise Memoize in binary mode with a large fixed width type and a
@@ -312,7 +325,8 @@ SELECT * FROM strtest s1 INNER JOIN strtest s2 ON s1.n >= s2.n;', false);
Hits: 3 Misses: 3 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Scan using strtest_n_idx on strtest s2 (actual rows=4 loops=N)
Index Cond: (n <= s1.n)
-(10 rows)
+ Index Searches: N
+(11 rows)
-- Ensure we get 3 hits and 3 misses
SELECT explain_memoize('
@@ -329,7 +343,8 @@ SELECT * FROM strtest s1 INNER JOIN strtest s2 ON s1.t >= s2.t;', false);
Hits: 3 Misses: 3 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Scan using strtest_t_idx on strtest s2 (actual rows=4 loops=N)
Index Cond: (t <= s1.t)
-(10 rows)
+ Index Searches: N
+(11 rows)
DROP TABLE strtest;
-- Ensure memoize works with partitionwise join
@@ -349,6 +364,7 @@ SELECT * FROM prt t1 INNER JOIN prt t2 ON t1.a = t2.a;', false);
Append (actual rows=32 loops=N)
-> Nested Loop (actual rows=16 loops=N)
-> Index Only Scan using iprt_p1_a on prt_p1 t1_1 (actual rows=4 loops=N)
+ Index Searches: N
Heap Fetches: N
-> Memoize (actual rows=4 loops=N)
Cache Key: t1_1.a
@@ -356,9 +372,11 @@ SELECT * FROM prt t1 INNER JOIN prt t2 ON t1.a = t2.a;', false);
Hits: 3 Misses: 1 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Only Scan using iprt_p1_a on prt_p1 t2_1 (actual rows=4 loops=N)
Index Cond: (a = t1_1.a)
+ Index Searches: N
Heap Fetches: N
-> Nested Loop (actual rows=16 loops=N)
-> Index Only Scan using iprt_p2_a on prt_p2 t1_2 (actual rows=4 loops=N)
+ Index Searches: N
Heap Fetches: N
-> Memoize (actual rows=4 loops=N)
Cache Key: t1_2.a
@@ -366,8 +384,9 @@ SELECT * FROM prt t1 INNER JOIN prt t2 ON t1.a = t2.a;', false);
Hits: 3 Misses: 1 Evictions: Zero Overflows: 0 Memory Usage: NkB
-> Index Only Scan using iprt_p2_a on prt_p2 t2_2 (actual rows=4 loops=N)
Index Cond: (a = t1_2.a)
+ Index Searches: N
Heap Fetches: N
-(21 rows)
+(25 rows)
-- Ensure memoize works with parameterized union-all Append path
SET enable_partitionwise_join TO off;
@@ -379,6 +398,7 @@ ON t1.a = t2.a;', false);
-------------------------------------------------------------------------------------
Nested Loop (actual rows=16 loops=N)
-> Index Only Scan using iprt_p1_a on prt_p1 t1 (actual rows=4 loops=N)
+ Index Searches: N
Heap Fetches: N
-> Memoize (actual rows=4 loops=N)
Cache Key: t1.a
@@ -387,11 +407,13 @@ ON t1.a = t2.a;', false);
-> Append (actual rows=4 loops=N)
-> Index Only Scan using iprt_p1_a on prt_p1 (actual rows=4 loops=N)
Index Cond: (a = t1.a)
+ Index Searches: N
Heap Fetches: N
-> Index Only Scan using iprt_p2_a on prt_p2 (actual rows=0 loops=N)
Index Cond: (a = t1.a)
+ Index Searches: N
Heap Fetches: N
-(14 rows)
+(17 rows)
DROP TABLE prt;
RESET enable_partitionwise_join;
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 7a03b4e36..18ea272b6 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -2340,6 +2340,10 @@ begin
ln := regexp_replace(ln, 'Workers Launched: \d+', 'Workers Launched: N');
ln := regexp_replace(ln, 'actual rows=\d+ loops=\d+', 'actual rows=N loops=N');
ln := regexp_replace(ln, 'Rows Removed by Filter: \d+', 'Rows Removed by Filter: N');
+ perform regexp_matches(ln, 'Index Searches: \d+');
+ if found then
+ continue;
+ end if;
return next ln;
end loop;
end;
@@ -2692,12 +2696,13 @@ select * from ab where a = (select max(a) from lprt_a) and b = (select max(a)-1
Filter: (b = (InitPlan 2).col1)
-> Bitmap Index Scan on ab_a3_b2_a_idx (actual rows=0 loops=1)
Index Cond: (a = (InitPlan 1).col1)
+ Index Searches: 1
-> Bitmap Heap Scan on ab_a3_b3 ab_9 (never executed)
Recheck Cond: (a = (InitPlan 1).col1)
Filter: (b = (InitPlan 2).col1)
-> Bitmap Index Scan on ab_a3_b3_a_idx (never executed)
Index Cond: (a = (InitPlan 1).col1)
-(52 rows)
+(53 rows)
-- Test run-time partition pruning with UNION ALL parents
explain (analyze, costs off, summary off, timing off)
@@ -2713,6 +2718,7 @@ select * from (select * from ab where a = 1 union all select * from ab) ab where
Filter: (b = (InitPlan 1).col1)
-> Bitmap Index Scan on ab_a1_b1_a_idx (actual rows=0 loops=1)
Index Cond: (a = 1)
+ Index Searches: 1
-> Bitmap Heap Scan on ab_a1_b2 ab_12 (never executed)
Recheck Cond: (a = 1)
Filter: (b = (InitPlan 1).col1)
@@ -2741,7 +2747,7 @@ select * from (select * from ab where a = 1 union all select * from ab) ab where
Filter: (b = (InitPlan 1).col1)
-> Seq Scan on ab_a3_b3 ab_9 (never executed)
Filter: (b = (InitPlan 1).col1)
-(37 rows)
+(38 rows)
-- A case containing a UNION ALL with a non-partitioned child.
explain (analyze, costs off, summary off, timing off)
@@ -2757,6 +2763,7 @@ select * from (select * from ab where a = 1 union all (values(10,5)) union all s
Filter: (b = (InitPlan 1).col1)
-> Bitmap Index Scan on ab_a1_b1_a_idx (actual rows=0 loops=1)
Index Cond: (a = 1)
+ Index Searches: 1
-> Bitmap Heap Scan on ab_a1_b2 ab_12 (never executed)
Recheck Cond: (a = 1)
Filter: (b = (InitPlan 1).col1)
@@ -2787,7 +2794,7 @@ select * from (select * from ab where a = 1 union all (values(10,5)) union all s
Filter: (b = (InitPlan 1).col1)
-> Seq Scan on ab_a3_b3 ab_9 (never executed)
Filter: (b = (InitPlan 1).col1)
-(39 rows)
+(40 rows)
-- Another UNION ALL test, but containing a mix of exec init and exec run-time pruning.
create table xy_1 (x int, y int);
@@ -2858,16 +2865,19 @@ update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a;');
Recheck Cond: (a = 1)
-> Bitmap Index Scan on ab_a1_b1_a_idx (actual rows=0 loops=1)
Index Cond: (a = 1)
+ Index Searches: 1
-> Bitmap Heap Scan on ab_a1_b2 ab_a1_2 (actual rows=1 loops=1)
Recheck Cond: (a = 1)
Heap Blocks: exact=1
-> Bitmap Index Scan on ab_a1_b2_a_idx (actual rows=1 loops=1)
Index Cond: (a = 1)
+ Index Searches: 1
-> Bitmap Heap Scan on ab_a1_b3 ab_a1_3 (actual rows=0 loops=1)
Recheck Cond: (a = 1)
Heap Blocks: exact=1
-> Bitmap Index Scan on ab_a1_b3_a_idx (actual rows=1 loops=1)
Index Cond: (a = 1)
+ Index Searches: 1
-> Materialize (actual rows=1 loops=1)
Storage: Memory Maximum Storage: NkB
-> Append (actual rows=1 loops=1)
@@ -2875,17 +2885,20 @@ update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a;');
Recheck Cond: (a = 1)
-> Bitmap Index Scan on ab_a1_b1_a_idx (actual rows=0 loops=1)
Index Cond: (a = 1)
+ Index Searches: 1
-> Bitmap Heap Scan on ab_a1_b2 ab_2 (actual rows=1 loops=1)
Recheck Cond: (a = 1)
Heap Blocks: exact=1
-> Bitmap Index Scan on ab_a1_b2_a_idx (actual rows=1 loops=1)
Index Cond: (a = 1)
+ Index Searches: 1
-> Bitmap Heap Scan on ab_a1_b3 ab_3 (actual rows=0 loops=1)
Recheck Cond: (a = 1)
Heap Blocks: exact=1
-> Bitmap Index Scan on ab_a1_b3_a_idx (actual rows=1 loops=1)
Index Cond: (a = 1)
-(37 rows)
+ Index Searches: 1
+(43 rows)
table ab;
a | b
@@ -2961,8 +2974,10 @@ select * from tbl1 join tprt on tbl1.col1 > tprt.col1;
-> Append (actual rows=3 loops=2)
-> Index Scan using tprt1_idx on tprt_1 (actual rows=2 loops=2)
Index Cond: (col1 < tbl1.col1)
+ Index Searches: 2
-> Index Scan using tprt2_idx on tprt_2 (actual rows=2 loops=1)
Index Cond: (col1 < tbl1.col1)
+ Index Searches: 1
-> Index Scan using tprt3_idx on tprt_3 (never executed)
Index Cond: (col1 < tbl1.col1)
-> Index Scan using tprt4_idx on tprt_4 (never executed)
@@ -2971,7 +2986,7 @@ select * from tbl1 join tprt on tbl1.col1 > tprt.col1;
Index Cond: (col1 < tbl1.col1)
-> Index Scan using tprt6_idx on tprt_6 (never executed)
Index Cond: (col1 < tbl1.col1)
-(15 rows)
+(17 rows)
explain (analyze, costs off, summary off, timing off)
select * from tbl1 join tprt on tbl1.col1 = tprt.col1;
@@ -2984,6 +2999,7 @@ select * from tbl1 join tprt on tbl1.col1 = tprt.col1;
Index Cond: (col1 = tbl1.col1)
-> Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
Index Cond: (col1 = tbl1.col1)
+ Index Searches: 2
-> Index Scan using tprt3_idx on tprt_3 (never executed)
Index Cond: (col1 = tbl1.col1)
-> Index Scan using tprt4_idx on tprt_4 (never executed)
@@ -2992,7 +3008,7 @@ select * from tbl1 join tprt on tbl1.col1 = tprt.col1;
Index Cond: (col1 = tbl1.col1)
-> Index Scan using tprt6_idx on tprt_6 (never executed)
Index Cond: (col1 = tbl1.col1)
-(15 rows)
+(16 rows)
select tbl1.col1, tprt.col1 from tbl1
inner join tprt on tbl1.col1 > tprt.col1
@@ -3027,17 +3043,20 @@ select * from tbl1 inner join tprt on tbl1.col1 > tprt.col1;
-> Append (actual rows=5 loops=5)
-> Index Scan using tprt1_idx on tprt_1 (actual rows=2 loops=5)
Index Cond: (col1 < tbl1.col1)
+ Index Searches: 5
-> Index Scan using tprt2_idx on tprt_2 (actual rows=3 loops=4)
Index Cond: (col1 < tbl1.col1)
+ Index Searches: 4
-> Index Scan using tprt3_idx on tprt_3 (actual rows=1 loops=2)
Index Cond: (col1 < tbl1.col1)
+ Index Searches: 2
-> Index Scan using tprt4_idx on tprt_4 (never executed)
Index Cond: (col1 < tbl1.col1)
-> Index Scan using tprt5_idx on tprt_5 (never executed)
Index Cond: (col1 < tbl1.col1)
-> Index Scan using tprt6_idx on tprt_6 (never executed)
Index Cond: (col1 < tbl1.col1)
-(15 rows)
+(18 rows)
explain (analyze, costs off, summary off, timing off)
select * from tbl1 inner join tprt on tbl1.col1 = tprt.col1;
@@ -3050,15 +3069,17 @@ select * from tbl1 inner join tprt on tbl1.col1 = tprt.col1;
Index Cond: (col1 = tbl1.col1)
-> Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
Index Cond: (col1 = tbl1.col1)
+ Index Searches: 2
-> Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=3)
Index Cond: (col1 = tbl1.col1)
+ Index Searches: 3
-> Index Scan using tprt4_idx on tprt_4 (never executed)
Index Cond: (col1 = tbl1.col1)
-> Index Scan using tprt5_idx on tprt_5 (never executed)
Index Cond: (col1 = tbl1.col1)
-> Index Scan using tprt6_idx on tprt_6 (never executed)
Index Cond: (col1 = tbl1.col1)
-(15 rows)
+(17 rows)
select tbl1.col1, tprt.col1 from tbl1
inner join tprt on tbl1.col1 > tprt.col1
@@ -3122,7 +3143,8 @@ select * from tbl1 join tprt on tbl1.col1 < tprt.col1;
Index Cond: (col1 > tbl1.col1)
-> Index Scan using tprt6_idx on tprt_6 (actual rows=1 loops=1)
Index Cond: (col1 > tbl1.col1)
-(15 rows)
+ Index Searches: 1
+(16 rows)
select tbl1.col1, tprt.col1 from tbl1
inner join tprt on tbl1.col1 < tprt.col1
@@ -3482,12 +3504,14 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
Sort Key: ma_test.b
Subplans Removed: 1
-> Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_1 (actual rows=1 loops=1)
+ Index Searches: 1
Filter: ((a >= $1) AND ((a % 10) = 5))
Rows Removed by Filter: 9
-> Index Scan using ma_test_p3_b_idx on ma_test_p3 ma_test_2 (actual rows=1 loops=1)
+ Index Searches: 1
Filter: ((a >= $1) AND ((a % 10) = 5))
Rows Removed by Filter: 9
-(9 rows)
+(11 rows)
execute mt_q1(15);
a
@@ -3503,9 +3527,10 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
Sort Key: ma_test.b
Subplans Removed: 2
-> Index Scan using ma_test_p3_b_idx on ma_test_p3 ma_test_1 (actual rows=1 loops=1)
+ Index Searches: 1
Filter: ((a >= $1) AND ((a % 10) = 5))
Rows Removed by Filter: 9
-(6 rows)
+(7 rows)
execute mt_q1(25);
a
@@ -3553,13 +3578,16 @@ explain (analyze, costs off, summary off, timing off) select * from ma_test wher
-> Limit (actual rows=1 loops=1)
-> Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
Index Cond: (b IS NOT NULL)
+ Index Searches: 1
-> Index Scan using ma_test_p1_b_idx on ma_test_p1 ma_test_1 (never executed)
Filter: (a >= (InitPlan 2).col1)
-> Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_2 (actual rows=10 loops=1)
+ Index Searches: 1
Filter: (a >= (InitPlan 2).col1)
-> Index Scan using ma_test_p3_b_idx on ma_test_p3 ma_test_3 (actual rows=10 loops=1)
+ Index Searches: 1
Filter: (a >= (InitPlan 2).col1)
-(14 rows)
+(17 rows)
reset enable_seqscan;
reset enable_sort;
@@ -4129,14 +4157,17 @@ select * from rangep where b IN((select 1),(select 2)) order by a;
-> Merge Append (actual rows=0 loops=1)
Sort Key: rangep_2.a
-> Index Scan using rangep_0_to_100_1_a_idx on rangep_0_to_100_1 rangep_2 (actual rows=0 loops=1)
+ Index Searches: 1
Filter: (b = ANY (ARRAY[(InitPlan 1).col1, (InitPlan 2).col1]))
-> Index Scan using rangep_0_to_100_2_a_idx on rangep_0_to_100_2 rangep_3 (actual rows=0 loops=1)
+ Index Searches: 1
Filter: (b = ANY (ARRAY[(InitPlan 1).col1, (InitPlan 2).col1]))
-> Index Scan using rangep_0_to_100_3_a_idx on rangep_0_to_100_3 rangep_4 (never executed)
Filter: (b = ANY (ARRAY[(InitPlan 1).col1, (InitPlan 2).col1]))
-> Index Scan using rangep_100_to_200_a_idx on rangep_100_to_200 rangep_5 (actual rows=0 loops=1)
+ Index Searches: 1
Filter: (b = ANY (ARRAY[(InitPlan 1).col1, (InitPlan 2).col1]))
-(15 rows)
+(18 rows)
reset enable_sort;
drop table rangep;
diff --git a/src/test/regress/expected/select.out b/src/test/regress/expected/select.out
index 33a6dceb0..b7cf35b9a 100644
--- a/src/test/regress/expected/select.out
+++ b/src/test/regress/expected/select.out
@@ -763,8 +763,9 @@ select * from onek2 where unique2 = 11 and stringu1 = 'ATAAAA';
-----------------------------------------------------------------
Index Scan using onek2_u2_prtl on onek2 (actual rows=1 loops=1)
Index Cond: (unique2 = 11)
+ Index Searches: 1
Filter: (stringu1 = 'ATAAAA'::name)
-(3 rows)
+(4 rows)
explain (costs off)
select unique2 from onek2 where unique2 = 11 and stringu1 = 'ATAAAA';
diff --git a/src/test/regress/sql/memoize.sql b/src/test/regress/sql/memoize.sql
index 2eaeb1477..9afe205e0 100644
--- a/src/test/regress/sql/memoize.sql
+++ b/src/test/regress/sql/memoize.sql
@@ -23,8 +23,10 @@ begin
ln := regexp_replace(ln, 'Evictions: 0', 'Evictions: Zero');
ln := regexp_replace(ln, 'Evictions: \d+', 'Evictions: N');
ln := regexp_replace(ln, 'Memory Usage: \d+', 'Memory Usage: N');
- ln := regexp_replace(ln, 'Heap Fetches: \d+', 'Heap Fetches: N');
- ln := regexp_replace(ln, 'loops=\d+', 'loops=N');
+ ln := regexp_replace(ln, 'Heap Fetches: \d+', 'Heap Fetches: N');
+ ln := regexp_replace(ln, 'loops=\d+', 'loops=N');
+ ln := regexp_replace(ln, 'Index Searches: 0', 'Index Searches: Zero');
+ ln := regexp_replace(ln, 'Index Searches: \d+', 'Index Searches: N');
return next ln;
end loop;
end;
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index 442428d93..085e746af 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -573,6 +573,10 @@ begin
ln := regexp_replace(ln, 'Workers Launched: \d+', 'Workers Launched: N');
ln := regexp_replace(ln, 'actual rows=\d+ loops=\d+', 'actual rows=N loops=N');
ln := regexp_replace(ln, 'Rows Removed by Filter: \d+', 'Rows Removed by Filter: N');
+ perform regexp_matches(ln, 'Index Searches: \d+');
+ if found then
+ continue;
+ end if;
return next ln;
end loop;
end;
--
2.45.2
v6-0003-Refactor-handling-of-nbtree-array-redundancies.patchapplication/octet-stream; name=v6-0003-Refactor-handling-of-nbtree-array-redundancies.patchDownload
From 0e1c5d7e262d0930ef5f3d490de023a9ca00336f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 8 Aug 2024 15:41:18 -0400
Subject: [PATCH v6 3/4] Refactor handling of nbtree array redundancies.
Rather than allocating memory for so.keyData[] at the start of each
btrescan, lazily allocate space later on, in _bt_preprocess_keys. We
now allocate so.keyData[] after _bt_preprocess_array_keys is done
performing initial array related preprocessing.
An immediate benefit of this approach is that _bt_preprocess_array_keys
no longer needs to explicitly mark redundant array scan keys. Other
code (_bt_preprocess_keys and its other subsidiary routines) no longer
have to interpret the scan key entries as redundant. Redundant array
scan keys simply never appear in the _bt_preprocess_keys input array
(_bt_preprocess_array_keys removes them up front).
This refactoring is also preparation for an upcoming patch that will add
skip scan optimizations to nbtree. _bt_preprocess_array_keys will be
taught to add new skip array scan keys to the _bt_preprocess_keys input
array (i.e. to arrayKeyData), so doing things this way avoids uselessly
palloc'ing so.keyData[], only to have to repalloc (to enlarge the array)
almost immediately afterwards. This scheme allows _bt_preprocess_keys
to output a so.keyData[] scan key array that can be larger than the
original scan.keyData[] input array, due to the addition of skip array
scan keys within _bt_preprocess_array_keys.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
---
src/backend/access/nbtree/nbtree.c | 10 +-
src/backend/access/nbtree/nbtutils.c | 156 +++++++++++++--------------
2 files changed, 82 insertions(+), 84 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e055571c8..689028dc5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -325,11 +325,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
BTScanPosInvalidate(so->currPos);
BTScanPosInvalidate(so->markPos);
- if (scan->numberOfKeys > 0)
- so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
- else
- so->keyData = NULL;
+ so->keyData = NULL;
so->needPrimScan = false;
so->scanBehind = false;
so->oppoDirCheck = false;
@@ -411,6 +408,11 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->numberOfKeys = 0; /* until _bt_preprocess_keys sets it */
so->numArrayKeys = 0; /* ditto */
+
+ /* Release private storage allocated in previous btrescan, if any */
+ if (so->keyData != NULL)
+ pfree(so->keyData);
+ so->keyData = NULL;
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 98688a3d6..d1423bd85 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -62,7 +62,7 @@ static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
ScanKey arraysk, ScanKey skey,
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok);
-static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan);
+static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys);
static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
@@ -251,9 +251,6 @@ _bt_freestack(BTStack stack)
* It is convenient for _bt_preprocess_keys caller to have to deal with no
* more than one equality strategy array scan key per index attribute. We'll
* always be able to set things up that way when complete opfamilies are used.
- * Eliminated array scan keys can be recognized as those that have had their
- * sk_strategy field set to InvalidStrategy here by us. Caller should avoid
- * including these in the scan's so->keyData[] output array.
*
* We set the scan key references from the scan's BTArrayKeyInfo info array to
* offsets into the temp modified input array returned to caller. Scans that
@@ -261,18 +258,25 @@ _bt_freestack(BTStack stack)
* preprocessing steps are complete. This will convert the scan key offset
* references into references to the scan's so->keyData[] output scan keys.
*
+ * Caller must pass *numberOfKeys to give us a way to change the number of
+ * input scan keys (our output is caller's input). The returned array can be
+ * smaller than scan->keyData[] when we eliminated a redundant array scan key
+ * (redundant with some other array scan key, for the same attribute). Caller
+ * uses this to allocate so->keyData[] for the current btrescan.
+ *
* Note: the reason we need to return a temp scan key array, rather than just
* scribbling on scan->keyData, is that callers are permitted to call btrescan
* without supplying a new set of scankey data.
*/
static ScanKey
-_bt_preprocess_array_keys(IndexScanDesc scan)
+_bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- int numberOfKeys = scan->numberOfKeys;
+ int numArrayKeyData = scan->numberOfKeys;
int16 *indoption = rel->rd_indoption;
- int numArrayKeys;
+ int numArrayKeys,
+ output_ikey = 0;
int origarrayatt = InvalidAttrNumber,
origarraykey = -1;
Oid origelemtype = InvalidOid;
@@ -280,11 +284,11 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
MemoryContext oldContext;
ScanKey arrayKeyData; /* modified copy of scan->keyData */
- Assert(numberOfKeys);
+ Assert(scan->numberOfKeys);
/* Quick check to see if there are any array keys */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int i = 0; i < scan->numberOfKeys; i++)
{
cur = &scan->keyData[i];
if (cur->sk_flags & SK_SEARCHARRAY)
@@ -317,19 +321,18 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
oldContext = MemoryContextSwitchTo(so->arrayContext);
- /* Create modifiable copy of scan->keyData in the workspace context */
- arrayKeyData = (ScanKey) palloc(numberOfKeys * sizeof(ScanKeyData));
- memcpy(arrayKeyData, scan->keyData, numberOfKeys * sizeof(ScanKeyData));
+ /* Create output scan keys in the workspace context */
+ arrayKeyData = (ScanKey) palloc(numArrayKeyData * sizeof(ScanKeyData));
/* Allocate space for per-array data in the workspace context */
so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
/* Allocate space for ORDER procs used to help _bt_checkkeys */
- so->orderProcs = (FmgrInfo *) palloc(numberOfKeys * sizeof(FmgrInfo));
+ so->orderProcs = (FmgrInfo *) palloc(numArrayKeyData * sizeof(FmgrInfo));
/* Now process each array key */
numArrayKeys = 0;
- for (int i = 0; i < numberOfKeys; i++)
+ for (int input_ikey = 0; input_ikey < scan->numberOfKeys; input_ikey++)
{
FmgrInfo sortproc;
FmgrInfo *sortprocp = &sortproc;
@@ -345,14 +348,20 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
int num_nonnulls;
int j;
- cur = &arrayKeyData[i];
+ /*
+ * Copy input scan key into temp arrayKeyData scan key array
+ */
+ cur = &arrayKeyData[output_ikey];
+ *cur = scan->keyData[input_ikey];
+
if (!(cur->sk_flags & SK_SEARCHARRAY))
+ {
+ output_ikey++; /* keep this non-array scan key */
continue;
+ }
/*
- * First, deconstruct the array into elements. Anything allocated
- * here (including a possibly detoasted array value) is in the
- * workspace context.
+ * Deconstruct the array into elements
*/
arrayval = DatumGetArrayTypeP(cur->sk_argument);
/* We could cache this data, but not clear it's worth it */
@@ -406,6 +415,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTGreaterStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
case BTEqualStrategyNumber:
/* proceed with rest of loop */
@@ -416,6 +426,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
_bt_find_extreme_element(scan, cur, elemtype,
BTLessStrategyNumber,
elem_values, num_nonnulls);
+ output_ikey++; /* keep this transformed scan key */
continue;
default:
elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -432,7 +443,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* sortproc just points to the same proc used during binary searches.
*/
_bt_setup_array_cmp(scan, cur, elemtype,
- &so->orderProcs[i], &sortprocp);
+ &so->orderProcs[output_ikey], &sortprocp);
/*
* Sort the non-null elements and eliminate any duplicates. We must
@@ -476,11 +487,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
break;
}
- /*
- * Indicate to _bt_preprocess_keys caller that it must ignore
- * this scan key
- */
- cur->sk_strategy = InvalidStrategy;
+ /* Throw away this scan key/array */
continue;
}
@@ -511,12 +518,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
* Note: _bt_preprocess_array_keys_final will fix-up each array's
* scan_key field later on, after so->keyData[] has been finalized.
*/
- so->arrayKeys[numArrayKeys].scan_key = i;
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
numArrayKeys++;
+ output_ikey++; /* keep this scan key/array */
}
+ /* Set final number of arrayKeyData[] keys, array keys */
+ *numberOfKeys = output_ikey;
so->numArrayKeys = numArrayKeys;
MemoryContextSwitchTo(oldContext);
@@ -2429,10 +2439,12 @@ end_toplevel_scan:
/*
* _bt_preprocess_keys() -- Preprocess scan keys
*
+ * The first call here (per btrescan) allocates so->keyData[].
* The given search-type keys (taken from scan->keyData[])
* are copied to so->keyData[] with possible transformation.
* scan->numberOfKeys is the number of input keys, so->numberOfKeys gets
- * the number of output keys (possibly less, never greater).
+ * the number of output keys. Calling here a second or subsequent time
+ * (during the same btrescan) is a no-op.
*
* The output keys are marked with additional sk_flags bits beyond the
* system-standard bits supplied by the caller. The DESC and NULLS_FIRST
@@ -2519,9 +2531,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
int16 *indoption = scan->indexRelation->rd_indoption;
int new_numberOfKeys;
int numberOfEqualCols;
- ScanKey inkeys;
- ScanKey outkeys;
- ScanKey cur;
+ ScanKey inputsk;
BTScanKeyPreproc xform[BTMaxStrategyNumber];
bool test_result;
int i,
@@ -2553,7 +2563,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
return; /* done if qual-less scan */
/* If any keys are SK_SEARCHARRAY type, set up array-key info */
- arrayKeyData = _bt_preprocess_array_keys(scan);
+ arrayKeyData = _bt_preprocess_array_keys(scan, &numberOfKeys);
if (!so->qual_ok)
{
/* unmatchable array, so give up */
@@ -2567,32 +2577,36 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
if (arrayKeyData)
{
- inkeys = arrayKeyData;
+ inputsk = arrayKeyData;
/* Also maintain keyDataMap for remapping so->orderProc[] later */
keyDataMap = MemoryContextAlloc(so->arrayContext,
numberOfKeys * sizeof(int));
}
else
- inkeys = scan->keyData;
+ inputsk = scan->keyData;
+
+ /*
+ * Now that we have an estimate of the number of output scan keys,
+ * allocate space for them
+ */
+ so->keyData = palloc(sizeof(ScanKeyData) * numberOfKeys);
- outkeys = so->keyData;
- cur = &inkeys[0];
/* we check that input keys are correctly ordered */
- if (cur->sk_attno < 1)
+ if (inputsk[0].sk_attno < 1)
elog(ERROR, "btree index keys must be ordered by attribute");
/* We can short-circuit most of the work if there's just one key */
if (numberOfKeys == 1)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
so->qual_ok = false;
- memcpy(outkeys, cur, sizeof(ScanKeyData));
+ memcpy(&so->keyData[0], &inputsk[0], sizeof(ScanKeyData));
so->numberOfKeys = 1;
/* We can mark the qual as required if it's for first index col */
- if (cur->sk_attno == 1)
- _bt_mark_scankey_required(outkeys);
+ if (inputsk[0].sk_attno == 1)
+ _bt_mark_scankey_required(&so->keyData[0]);
if (arrayKeyData)
{
/*
@@ -2600,8 +2614,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
* (we'll miss out on the single value array transformation, but
* that's not nearly as important when there's only one scan key)
*/
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- Assert(cur->sk_strategy != BTEqualStrategyNumber ||
+ Assert(so->keyData[0].sk_flags & SK_SEARCHARRAY);
+ Assert(so->keyData[0].sk_strategy != BTEqualStrategyNumber ||
(so->arrayKeys[0].scan_key == 0 &&
OidIsValid(so->orderProcs[0].fn_oid)));
}
@@ -2629,12 +2643,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* handle after-last-key processing. Actual exit from the loop is at the
* "break" statement below.
*/
- for (i = 0;; cur++, i++)
+ for (i = 0;; inputsk++, i++)
{
if (i < numberOfKeys)
{
/* Apply indoption to scankey (might change sk_strategy!) */
- if (!_bt_fix_scankey_strategy(cur, indoption))
+ if (!_bt_fix_scankey_strategy(inputsk, indoption))
{
/* NULL can't be matched, so give up */
so->qual_ok = false;
@@ -2646,12 +2660,12 @@ _bt_preprocess_keys(IndexScanDesc scan)
* If we are at the end of the keys for a particular attr, finish up
* processing and emit the cleaned-up keys.
*/
- if (i == numberOfKeys || cur->sk_attno != attno)
+ if (i == numberOfKeys || inputsk->sk_attno != attno)
{
int priorNumberOfEqualCols = numberOfEqualCols;
/* check input keys are correctly ordered */
- if (i < numberOfKeys && cur->sk_attno < attno)
+ if (i < numberOfKeys && inputsk->sk_attno < attno)
elog(ERROR, "btree index keys must be ordered by attribute");
/*
@@ -2755,7 +2769,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
}
/*
- * Emit the cleaned-up keys into the outkeys[] array, and then
+ * Emit the cleaned-up keys into the so->keyData[] array, and then
* mark them if they are required. They are required (possibly
* only in one direction) if all attrs before this one had "=".
*/
@@ -2763,7 +2777,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
{
if (xform[j].skey)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
@@ -2780,19 +2794,19 @@ _bt_preprocess_keys(IndexScanDesc scan)
break;
/* Re-initialize for new attno */
- attno = cur->sk_attno;
+ attno = inputsk->sk_attno;
memset(xform, 0, sizeof(xform));
}
/* check strategy this key's operator corresponds to */
- j = cur->sk_strategy - 1;
+ j = inputsk->sk_strategy - 1;
/* if row comparison, push it directly to the output array */
- if (cur->sk_flags & SK_ROW_HEADER)
+ if (inputsk->sk_flags & SK_ROW_HEADER)
{
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
- memcpy(outkey, cur, sizeof(ScanKeyData));
+ memcpy(outkey, inputsk, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = i;
if (numberOfEqualCols == attno - 1)
@@ -2806,21 +2820,10 @@ _bt_preprocess_keys(IndexScanDesc scan)
continue;
}
- /*
- * Does this input scan key require further processing as an array?
- */
- if (cur->sk_strategy == InvalidStrategy)
+ if (inputsk->sk_strategy == BTEqualStrategyNumber &&
+ (inputsk->sk_flags & SK_SEARCHARRAY))
{
- /* _bt_preprocess_array_keys marked this array key redundant */
- Assert(arrayKeyData);
- Assert(cur->sk_flags & SK_SEARCHARRAY);
- continue;
- }
-
- if (cur->sk_strategy == BTEqualStrategyNumber &&
- (cur->sk_flags & SK_SEARCHARRAY))
- {
- /* _bt_preprocess_array_keys kept this array key */
+ /* maintain arrayidx for xform[] array */
Assert(arrayKeyData);
arrayidx++;
}
@@ -2832,7 +2835,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (xform[j].skey == NULL)
{
/* nope, so this scan key wins by default (at least for now) */
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2850,7 +2853,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
/*
* Have to set up array keys
*/
- if ((cur->sk_flags & SK_SEARCHARRAY))
+ if (inputsk->sk_flags & SK_SEARCHARRAY)
{
array = &so->arrayKeys[arrayidx - 1];
orderproc = so->orderProcs + i;
@@ -2878,7 +2881,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
*/
}
- if (_bt_compare_scankey_args(scan, cur, cur, xform[j].skey,
+ if (_bt_compare_scankey_args(scan, inputsk, inputsk, xform[j].skey,
array, orderproc, &test_result))
{
/* Have all we need to determine redundancy */
@@ -2892,7 +2895,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -2905,7 +2908,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
* scan key. _bt_compare_scankey_args expects us to
* always keep arrays (and discard non-arrays).
*/
- Assert(!(cur->sk_flags & SK_SEARCHARRAY));
+ Assert(!(inputsk->sk_flags & SK_SEARCHARRAY));
}
}
else if (j == (BTEqualStrategyNumber - 1))
@@ -2928,14 +2931,14 @@ _bt_preprocess_keys(IndexScanDesc scan)
* even with incomplete opfamilies. _bt_advance_array_keys
* depends on this.
*/
- ScanKey outkey = &outkeys[new_numberOfKeys++];
+ ScanKey outkey = &so->keyData[new_numberOfKeys++];
memcpy(outkey, xform[j].skey, sizeof(ScanKeyData));
if (arrayKeyData)
keyDataMap[new_numberOfKeys - 1] = xform[j].ikey;
if (numberOfEqualCols == attno - 1)
_bt_mark_scankey_required(outkey);
- xform[j].skey = cur;
+ xform[j].skey = inputsk;
xform[j].ikey = i;
xform[j].arrayidx = arrayidx;
}
@@ -3349,13 +3352,6 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
return true;
}
- if (skey->sk_strategy == InvalidStrategy)
- {
- /* Already-eliminated array scan key; don't need to fix anything */
- Assert(skey->sk_flags & SK_SEARCHARRAY);
- return true;
- }
-
/* Adjust strategy for DESC, if we didn't already */
if ((addflags & SK_BT_DESC) && !(skey->sk_flags & SK_BT_DESC))
skey->sk_strategy = BTCommuteStrategyNumber(skey->sk_strategy);
--
2.45.2
v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patchapplication/octet-stream; name=v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patchDownload
From 44e64c24e6ba073a2f97cef15ae49281c2046e86 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 8 Aug 2024 13:51:18 -0400
Subject: [PATCH v6 2/4] Normalize nbtree truncated high key array behavior.
Commit 5bf748b8 taught nbtree ScalarArrayOp array processing to decide
when and how to start the next primitive index scan based on physical
index characteristics. This included rules for deciding whether to
start a new primitive index scan (or whether to move onto the right
sibling leaf page instead) whenever the scan encounters a leaf high key
with truncated lower-order columns whose omitted/-inf values are covered
by one or more arrays.
Prior to this commit, nbtree would treat a truncated column as
satisfying a scan key that marked required in the current scan
direction. It would just give up and start a new primitive index scan
in cases involving inequalities required in the opposite direction only
(in practice this meant > and >= strategy scan keys, since only forward
scans consider the page high key like this).
Bring > and >= strategy scan keys in line with other required scan key
types: have nbtree persist with its current primitive index scan
regardless of the operator strategy in use. This requires scheduling
and then performing an explicit check of the next page's high key (if
any) at the point that _bt_readpage is next called.
Although this could be considered a stand alone piece of work, it's
mostly intended as preparation for an upcoming patch that adds skip scan
optimizations to nbtree. Without this work there are cases where the
scan's skip arrays trigger an excessive number of primitive index scans
due to most high keys having a truncated attribute that was previously
treated as not satisfying a required > or >= strategy scan key.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
---
src/include/access/nbtree.h | 3 +
src/backend/access/nbtree/nbtree.c | 4 +
src/backend/access/nbtree/nbtsearch.c | 22 +++++
src/backend/access/nbtree/nbtutils.c | 119 ++++++++++++++------------
4 files changed, 95 insertions(+), 53 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9af9b3ecd..f578cdb73 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1048,6 +1048,7 @@ typedef struct BTScanOpaqueData
int numArrayKeys; /* number of equality-type array keys */
bool needPrimScan; /* New prim scan to continue in current dir? */
bool scanBehind; /* Last array advancement matched -inf attr? */
+ bool oppoDirCheck; /* check opposite dir scan keys? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
FmgrInfo *orderProcs; /* ORDER procs for required equality keys */
MemoryContext arrayContext; /* scan-lifespan context for array data */
@@ -1288,6 +1289,8 @@ extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_preprocess_keys(IndexScanDesc scan);
extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
IndexTuple tuple, int tupnatts);
+extern bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
+ IndexTuple finaltup);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index dfef6c12d..e055571c8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -332,6 +332,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->needPrimScan = false;
so->scanBehind = false;
+ so->oppoDirCheck = false;
so->arrayKeys = NULL;
so->orderProcs = NULL;
so->arrayContext = NULL;
@@ -375,6 +376,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
so->markItemIndex = -1;
so->needPrimScan = false;
so->scanBehind = false;
+ so->oppoDirCheck = false;
BTScanPosUnpinIfPinned(so->markPos);
BTScanPosInvalidate(so->markPos);
@@ -629,6 +631,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
*/
so->needPrimScan = false;
so->scanBehind = false;
+ so->oppoDirCheck = false;
}
else
{
@@ -673,6 +676,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
}
so->needPrimScan = true;
so->scanBehind = false;
+ so->oppoDirCheck = false;
*pageno = InvalidBlockNumber;
exit_loop = true;
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 4b91a192e..e5f941e0a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1704,6 +1704,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
ItemId iid = PageGetItemId(page, P_HIKEY);
pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+
+ if (unlikely(so->oppoDirCheck))
+ {
+ /*
+ * Last _bt_readpage call scheduled precheck of finaltup for
+ * required scan keys up to and including a > or >= scan key
+ * (necessary because > and >= are only generally considered
+ * required when scanning backwards)
+ */
+ Assert(so->scanBehind);
+ so->oppoDirCheck = false;
+ if (!_bt_oppodir_checkkeys(scan, dir, pstate.finaltup))
+ {
+ /*
+ * Back out of continuing with this leaf page -- schedule
+ * another primitive index scan after all
+ */
+ so->currPos.moreRight = false;
+ so->needPrimScan = true;
+ return false;
+ }
+ }
}
/* load items[] in ascending order */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c22ccec78..98688a3d6 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1362,7 +1362,7 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
curArrayKey->cur_elem = 0;
skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
}
- so->scanBehind = false;
+ so->scanBehind = so->oppoDirCheck = false; /* reset */
}
/*
@@ -1671,8 +1671,7 @@ _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
Assert(so->numArrayKeys);
- /* scanBehind flag doesn't persist across primitive index scans - reset */
- so->scanBehind = false;
+ so->scanBehind = so->oppoDirCheck = false; /* reset */
/*
* Array keys are advanced within _bt_checkkeys when the scan reaches the
@@ -1808,7 +1807,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
tupnatts, false, 0, NULL));
- so->scanBehind = false; /* reset */
+ so->scanBehind = so->oppoDirCheck = false; /* reset */
/*
* Required scan key wasn't satisfied, so required arrays will have to
@@ -2293,19 +2292,27 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
if (so->scanBehind && has_required_opposite_direction_only)
{
/*
- * However, we avoid this behavior whenever the scan involves a scan
+ * However, we do things differently whenever the scan involves a scan
* key required in the opposite direction to the scan only, along with
* a finaltup with at least one truncated attribute that's associated
* with a scan key marked required (required in either direction).
*
* _bt_check_compare simply won't stop the scan for a scan key that's
* marked required in the opposite scan direction only. That leaves
- * us without any reliable way of reconsidering any opposite-direction
+ * us without an automatic way of reconsidering any opposite-direction
* inequalities if it turns out that starting a new primitive index
* scan will allow _bt_first to skip ahead by a great many leaf pages
* (see next section for details of how that works).
+ *
+ * We deal with this by explicitly scheduling a finaltup recheck for
+ * the next page -- we'll call _bt_oppodir_checkkeys for the next
+ * page's finaltup instead. You can think of this as a way of dealing
+ * with this page's finaltup being truncated by checking the next
+ * page's finaltup instead. And you can think of the oppoDirCheck
+ * recheck handling within _bt_readpage as complementing the similar
+ * scanBehind recheck made from within _bt_checkkeys.
*/
- goto new_prim_scan;
+ so->oppoDirCheck = true; /* schedule next page's finaltup recheck */
}
/*
@@ -2343,54 +2350,16 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
* also before the _bt_first-wise start of tuples for our new qual. That
* at least suggests many more skippable pages beyond the current page.
*/
- if (has_required_opposite_direction_only && pstate->finaltup &&
- (all_required_satisfied || oppodir_inequality_sktrig))
+ else if (has_required_opposite_direction_only && pstate->finaltup &&
+ (all_required_satisfied || oppodir_inequality_sktrig) &&
+ unlikely(!_bt_oppodir_checkkeys(scan, dir, pstate->finaltup)))
{
- int nfinaltupatts = BTreeTupleGetNAtts(pstate->finaltup, rel);
- ScanDirection flipped;
- bool continuescanflip;
- int opsktrig;
-
/*
- * We're checking finaltup (which is usually not caller's tuple), so
- * cannot reuse work from caller's earlier _bt_check_compare call.
- *
- * Flip the scan direction when calling _bt_check_compare this time,
- * so that it will set continuescanflip=false when it encounters an
- * inequality required in the opposite scan direction.
+ * Make sure that any non-required arrays are set to the first array
+ * element for the current scan direction
*/
- Assert(!so->scanBehind);
- opsktrig = 0;
- flipped = -dir;
- _bt_check_compare(scan, flipped,
- pstate->finaltup, nfinaltupatts, tupdesc,
- false, false, false,
- &continuescanflip, &opsktrig);
-
- /*
- * Only start a new primitive index scan when finaltup has a required
- * unsatisfied inequality (unsatisfied in the opposite direction)
- */
- Assert(all_required_satisfied != oppodir_inequality_sktrig);
- if (unlikely(!continuescanflip &&
- so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber))
- {
- /*
- * It's possible for the same inequality to be unsatisfied by both
- * caller's tuple (in scan's direction) and finaltup (in the
- * opposite direction) due to _bt_check_compare's behavior with
- * NULLs
- */
- Assert(opsktrig >= sktrig); /* not opsktrig > sktrig due to NULLs */
-
- /*
- * Make sure that any non-required arrays are set to the first
- * array element for the current scan direction
- */
- _bt_rewind_nonrequired_arrays(scan, dir);
-
- goto new_prim_scan;
- }
+ _bt_rewind_nonrequired_arrays(scan, dir);
+ goto new_prim_scan;
}
/*
@@ -3522,7 +3491,8 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
*
* Assert that the scan isn't in danger of becoming confused.
*/
- Assert(!so->scanBehind && !pstate->prechecked && !pstate->firstmatch);
+ Assert(!so->scanBehind && !so->oppoDirCheck);
+ Assert(!pstate->prechecked && !pstate->firstmatch);
Assert(!_bt_tuple_before_array_skeys(scan, dir, tuple, tupdesc,
tupnatts, false, 0, NULL));
}
@@ -3634,6 +3604,49 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
ikey, true);
}
+/*
+ * Test whether an indextuple satisfies inequalities required in the opposite
+ * direction only (and lower-order equalities required in either direction).
+ *
+ * scan: index scan descriptor (containing a search-type scankey)
+ * dir: current scan direction (flipped by us to get opposite direction)
+ * finaltup: final index tuple on the page
+ *
+ * Caller's finaltup tuple is the page high key (for forwards scans), or the
+ * first non-pivot tuple (for backwards scans). Caller during scans with
+ * required array keys.
+ *
+ * Return true if finatup satisfies keys, false if not. If the tuple fails to
+ * pass the qual, then caller is should start another primitive index scan;
+ * _bt_first can efficiently relocate the scan to a far later leaf page.
+ *
+ * Note: we focus on required-in-opposite-direction scan keys (e.g. for a
+ * required > or >= key, assuming a forwards scan) because _bt_checkkeys() can
+ * always deal with required-in-current-direction scan keys on its own.
+ */
+bool
+_bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
+ IndexTuple finaltup)
+{
+ Relation rel = scan->indexRelation;
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ int nfinaltupatts = BTreeTupleGetNAtts(finaltup, rel);
+ bool continuescan;
+ ScanDirection flipped = -dir;
+ int ikey = 0;
+
+ Assert(so->numArrayKeys);
+
+ _bt_check_compare(scan, flipped, finaltup, nfinaltupatts, tupdesc,
+ false, false, false, &continuescan, &ikey);
+
+ if (!continuescan && so->keyData[ikey].sk_strategy != BTEqualStrategyNumber)
+ return false;
+
+ return true;
+}
+
/*
* Test whether an indextuple satisfies current scan condition.
*
--
2.45.2
v6-0004-Add-skip-scan-to-nbtree.patchapplication/octet-stream; name=v6-0004-Add-skip-scan-to-nbtree.patchDownload
From 7d4f52c26f91da587d20b84321501065025ddd5a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 16 Apr 2024 13:21:36 -0400
Subject: [PATCH v6 4/4] Add skip scan to nbtree.
Skip scan allows nbtree index scans to efficiently use a composite index
on the columns (a, b) for queries with a qual "WHERE b = 5". This is
useful in cases where the total number of distinct values in the column
'a' is reasonably small (think hundreds, possibly thousands). In
effect, a skip scan treats the composite index on (a, b) as if it was a
series of disjunct subindexes -- one subindex per distinct 'a' value.
The scan exhaustively "searches every subindex" by using a qual that
behaves like "WHERE a = ANY(<every possible 'a' value>) AND b = 5".
This works by extending the design for arrays established by commit
5bf748b8. "Skip arrays" generate their array values procedurally and
on-demand, but otherwise work just like conventional SAOP arrays.
The core B-Tree operator classes on most discrete types generate their
array elements with help from their own custom skip support routine.
This infrastructure gives nbtree a way to generate the next required
array element by incrementing (or decrementing) the current array value.
It can reduce the number of index descents in cases where the very next
indexable value frequently turns out to be the next indexed value. In
practice, this is likely whenever the scan skips with an input opclass
where "dense" indexed values occur naturally, such as btree/date_ops.
Opclasses that lack a skip support routine fall back on having nbtree
"increment" (or "decrement") a skip array's current element by setting
the NEXTPRIOR scan key flag, without directly changing its sk_argument.
The presence of NEXTPRIOR makes the scan interpret the key's sk_argument
as coming immediately after (or coming immediately before) sk_argument
in the key space. The key value must still come before (or still come
after) any possible greater-than (or less-than) indexable/non-sentinel
value. Obviously, the scan will never locate any exactly equal tuples.
But attempting to locate a match serves to make the scan locate the true
next value in whatever way it determines is most efficient, without any
need for special cases in high level scan-related code. In particular,
this design obviates the need for explicit "next key" index probes.
Though it's typical for nbtree preprocessing to cons up skip arrays when
it will allow the scan to apply one or more omitted-from-query leading
key columns when skipping, that's never a requirement. There are hardly
any limitations around where skip arrays/scan keys may appear relative
to conventional/input scan keys. This is no less true in the presence
of conventional SAOP array scan keys, which may both roll over and be
rolled over by skip arrays. For example, a skip array on the column "b"
is generated with quals such as "WHERE a = 42 AND c IN (1, 2, 3)". As
with any nbtree scan involving arrays, whether or not we actually skip
depends on the physical characteristics of the index during the scan.
The optimizer doesn't use distinct new index paths to represent index
skip scans. Skipping isn't an either/or question. It's possible for
individual index scans to conspicuously vary how and when they skip in
order to deal with variation in how leading column values cluster
together over the key space of the index. A dynamic strategy seems to
work best. Skipping can be used during nbtree bitmap index scans,
nbtree index scans, and nbtree index-only scans. Parallel index skip
scan is also supported.
Preprocessing will never cons up a skip array for an index column that
has an equality strategy scan key on input, but will do so for indexed
columns that are only constrained by inequality type input scan keys.
This allows the scan to skip using a range skip array. These are just
like conventional skip arrays, but only generate values from within a
given range. The range is constrained by input inequality scan keys.
For example, a skip array on "a" can only ever use array element values
1 and 2 when generated for a qual "WHERE a BETWEEN 1 AND 2 AND b = 66".
Such a skip array works very much like the conventional SAOP array that
would be generated given the qual "WHERE a = ANY('{1, 2}') AND b = 66".
Such transformations only happen when they enable later preprocessing to
mark the copied-from-input scan key on "b" required to continue the scan
(otherwise, preprocessing directly outputs the >= and <= keys on "a" in
the traditional way, without adding a superseding skip array on "a").
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
---
src/include/access/amapi.h | 3 +-
src/include/access/nbtree.h | 27 +-
src/include/catalog/pg_amproc.dat | 16 +
src/include/catalog/pg_proc.dat | 24 +
src/include/storage/lwlock.h | 1 +
src/include/utils/skipsupport.h | 109 ++
src/backend/access/index/indexam.c | 3 +-
src/backend/access/nbtree/nbtcompare.c | 261 +++
src/backend/access/nbtree/nbtree.c | 218 ++-
src/backend/access/nbtree/nbtsearch.c | 93 +-
src/backend/access/nbtree/nbtutils.c | 1418 +++++++++++++++--
src/backend/access/nbtree/nbtvalidate.c | 4 +
src/backend/commands/opclasscmds.c | 25 +
src/backend/storage/lmgr/lwlock.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/adt/Makefile | 1 +
src/backend/utils/adt/date.c | 44 +
src/backend/utils/adt/meson.build | 1 +
src/backend/utils/adt/selfuncs.c | 368 +++--
src/backend/utils/adt/skipsupport.c | 60 +
src/backend/utils/adt/uuid.c | 67 +
src/backend/utils/misc/guc_tables.c | 23 +
doc/src/sgml/btree.sgml | 13 +
doc/src/sgml/indexam.sgml | 3 +-
doc/src/sgml/indices.sgml | 40 +-
doc/src/sgml/xindex.sgml | 16 +-
src/test/regress/expected/alter_generic.out | 6 +-
src/test/regress/expected/create_index.out | 4 +-
src/test/regress/expected/join.out | 61 +-
src/test/regress/expected/psql.out | 3 +-
src/test/regress/expected/union.out | 15 +-
src/test/regress/sql/alter_generic.sql | 2 +-
src/test/regress/sql/create_index.sql | 4 +-
src/tools/pgindent/typedefs.list | 3 +
34 files changed, 2630 insertions(+), 308 deletions(-)
create mode 100644 src/include/utils/skipsupport.h
create mode 100644 src/backend/utils/adt/skipsupport.c
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index f25c9d58a..651843b4e 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -195,7 +195,8 @@ typedef void (*amrestrpos_function) (IndexScanDesc scan);
*/
/* estimate size of parallel scan descriptor */
-typedef Size (*amestimateparallelscan_function) (int nkeys, int norderbys);
+typedef Size (*amestimateparallelscan_function) (Relation indexRelation,
+ int nkeys, int norderbys);
/* prepare for parallel index scan */
typedef void (*aminitparallelscan_function) (void *target);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f578cdb73..7271c7033 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+#include "utils/skipsupport.h"
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -709,7 +710,8 @@ BTreeTupleGetMaxHeapTID(IndexTuple itup)
#define BTINRANGE_PROC 3
#define BTEQUALIMAGE_PROC 4
#define BTOPTIONS_PROC 5
-#define BTNProcs 5
+#define BTSKIPSUPPORT_PROC 6
+#define BTNProcs 6
/*
* We need to be able to tell the difference between read and write
@@ -1031,10 +1033,22 @@ typedef BTScanPosData *BTScanPos;
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
typedef struct BTArrayKeyInfo
{
+ /* fields used by both kinds of array (standard arrays and skip arrays) */
int scan_key; /* index of associated key in keyData */
+ int num_elems; /* number of elems (-1 for skip array) */
+
+ /* fields for standard arrays that store elements in memory */
int cur_elem; /* index of current element in elem_values */
- int num_elems; /* number of elems in current array value */
Datum *elem_values; /* array of num_elems Datums */
+
+ /* fields for skip arrays, which generate their elements procedurally */
+ bool use_sksup; /* sksup set to valid routine? */
+ bool null_elem; /* lowest/highest element actually NULL? */
+ SkipSupportData sksup; /* opclass skip scan support, when use_sksup */
+ ScanKey low_compare; /* array's > or >= lower bound */
+ ScanKey high_compare; /* array's < or <= upper bound */
+ FmgrInfo order_low; /* low_compare's ORDER proc */
+ FmgrInfo order_high; /* high_compare's ORDER proc */
} BTArrayKeyInfo;
typedef struct BTScanOpaqueData
@@ -1124,6 +1138,9 @@ typedef struct BTReadPageState
*/
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
+#define SK_BT_SKIP 0x00040000 /* skip array, for skip scan */
+#define SK_BT_NEGPOSINF 0x00080000 /* no sk_argument, -inf/+inf key */
+#define SK_BT_NEXTPRIOR 0x00100000 /* sk_argument is next/prior key */
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1160,6 +1177,10 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+/* GUC parameters (just a temporary convenience for reviewers) */
+extern PGDLLIMPORT int skipscan_prefix_cols;
+extern PGDLLIMPORT bool skipscan_skipsupport_enabled;
+
/*
* external entry points for btree, in nbtree.c
*/
@@ -1170,7 +1191,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
bool indexUnchanged,
struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
-extern Size btestimateparallelscan(int nkeys, int norderbys);
+extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
extern void btinitparallelscan(void *target);
extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index f639c3a6a..2a8f6f3f1 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -21,6 +21,8 @@
amprocrighttype => 'bit', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '1', amproc => 'btboolcmp' },
+{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
+ amprocrighttype => 'bool', amprocnum => '6', amproc => 'btboolskipsupport' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
@@ -41,12 +43,16 @@
amprocrighttype => 'char', amprocnum => '1', amproc => 'btcharcmp' },
{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'btcharskipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1', amproc => 'date_cmp' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '2', amproc => 'date_sortsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'date_skipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'timestamp', amprocnum => '1',
amproc => 'date_cmp_timestamp' },
@@ -122,6 +128,8 @@
amprocrighttype => 'int2', amprocnum => '2', amproc => 'btint2sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'btint2skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint24cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
@@ -141,6 +149,8 @@
amprocrighttype => 'int4', amprocnum => '2', amproc => 'btint4sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'btint4skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint48cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
@@ -160,6 +170,8 @@
amprocrighttype => 'int8', amprocnum => '2', amproc => 'btint8sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'btint8skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint84cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
@@ -193,6 +205,8 @@
amprocrighttype => 'oid', amprocnum => '2', amproc => 'btoidsortsupport' },
{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'btoidskipsupport' },
{ amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
amprocrighttype => 'oidvector', amprocnum => '1',
amproc => 'btoidvectorcmp' },
@@ -261,6 +275,8 @@
amprocrighttype => 'uuid', amprocnum => '2', amproc => 'uuid_sortsupport' },
{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'uuid_skipsupport' },
{ amprocfamily => 'btree/record_ops', amproclefttype => 'record',
amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordcmp' },
{ amprocfamily => 'btree/record_image_ops', amproclefttype => 'record',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 85f42be1b..17b089fa3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1004,18 +1004,27 @@
{ oid => '3129', descr => 'sort support',
proname => 'btint2sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint2sortsupport' },
+{ oid => '9290', descr => 'skip support',
+ proname => 'btint2skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint2skipsupport' },
{ oid => '351', descr => 'less-equal-greater',
proname => 'btint4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int4 int4', prosrc => 'btint4cmp' },
{ oid => '3130', descr => 'sort support',
proname => 'btint4sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint4sortsupport' },
+{ oid => '9291', descr => 'skip support',
+ proname => 'btint4skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint4skipsupport' },
{ oid => '842', descr => 'less-equal-greater',
proname => 'btint8cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int8 int8', prosrc => 'btint8cmp' },
{ oid => '3131', descr => 'sort support',
proname => 'btint8sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint8sortsupport' },
+{ oid => '9292', descr => 'skip support',
+ proname => 'btint8skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint8skipsupport' },
{ oid => '354', descr => 'less-equal-greater',
proname => 'btfloat4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'float4 float4', prosrc => 'btfloat4cmp' },
@@ -1034,12 +1043,18 @@
{ oid => '3134', descr => 'sort support',
proname => 'btoidsortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btoidsortsupport' },
+{ oid => '9293', descr => 'skip support',
+ proname => 'btoidskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btoidskipsupport' },
{ oid => '404', descr => 'less-equal-greater',
proname => 'btoidvectorcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'oidvector oidvector', prosrc => 'btoidvectorcmp' },
{ oid => '358', descr => 'less-equal-greater',
proname => 'btcharcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'char char', prosrc => 'btcharcmp' },
+{ oid => '9294', descr => 'skip support',
+ proname => 'btcharskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btcharskipsupport' },
{ oid => '359', descr => 'less-equal-greater',
proname => 'btnamecmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'name name', prosrc => 'btnamecmp' },
@@ -2214,6 +2229,9 @@
{ oid => '3136', descr => 'sort support',
proname => 'date_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'date_sortsupport' },
+{ oid => '9295', descr => 'skip support',
+ proname => 'date_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'date_skipsupport' },
{ oid => '4133', descr => 'window RANGE support',
proname => 'in_range', prorettype => 'bool',
proargtypes => 'date date interval bool bool',
@@ -4401,6 +4419,9 @@
{ oid => '1693', descr => 'less-equal-greater',
proname => 'btboolcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'bool bool', prosrc => 'btboolcmp' },
+{ oid => '9296', descr => 'skip support',
+ proname => 'btboolskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btboolskipsupport' },
{ oid => '1688', descr => 'hash',
proname => 'time_hash', prorettype => 'int4', proargtypes => 'time',
@@ -9239,6 +9260,9 @@
{ oid => '3300', descr => 'sort support',
proname => 'uuid_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'uuid_sortsupport' },
+{ oid => '9297', descr => 'skip support',
+ proname => 'uuid_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'uuid_skipsupport' },
{ oid => '2961', descr => 'I/O',
proname => 'uuid_recv', prorettype => 'uuid', proargtypes => 'internal',
prosrc => 'uuid_recv' },
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e..5b58739ad 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -192,6 +192,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_LOCK_MANAGER,
LWTRANCHE_PREDICATE_LOCK_MANAGER,
LWTRANCHE_PARALLEL_HASH_JOIN,
+ LWTRANCHE_PARALLEL_BTREE_SCAN,
LWTRANCHE_PARALLEL_QUERY_DSA,
LWTRANCHE_PER_SESSION_DSA,
LWTRANCHE_PER_SESSION_RECORD_TYPE,
diff --git a/src/include/utils/skipsupport.h b/src/include/utils/skipsupport.h
new file mode 100644
index 000000000..d91390fc6
--- /dev/null
+++ b/src/include/utils/skipsupport.h
@@ -0,0 +1,109 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.h
+ * Support routines for B-Tree skip scan.
+ *
+ * B-Tree operator classes for discrete types can optionally provide a support
+ * function for skipping. This is used during skip scans.
+ *
+ * A B-tree operator class that implements skip support provides B-tree index
+ * scans with a way of enumerating and iterating through every possible value
+ * from the domain of indexable values. This gives scans a way to determine
+ * the next value in line for a given skip array/scan key/skipped attribute.
+ * This happens at the point where the scan determines that another primitive
+ * index scan is required. The next value is used (in combination with at
+ * least one additional lower-order non-skip key, taken from the SQL query) to
+ * relocate the scan, skipping over many irrelevant leaf pages in the process.
+ *
+ * Skip support generally works best with discrete types such as integer,
+ * date, and boolean; types where there is a decent chance that indexes will
+ * contain contiguous values (given a leading attributes using the opclass).
+ * When gaps/discontinuities are naturally rare (e.g., a leading identity
+ * column in a composite index, a date column preceding a product_id column),
+ * then it makes sense for skip scans to optimistically assume that the next
+ * distinct indexable value will find directly matching index tuples.
+ *
+ * The B-Tree code can fall back on next-key sentinel values for any opclass
+ * that doesn't provide its own skip support function. There is no point in
+ * providing skip support unless the next indexed key value is often the next
+ * indexable value (at least with some workloads). Opclasses where that never
+ * works out in practice should just rely on the B-Tree AM's generic next-key
+ * fallback strategy. Opclasses where adding skip support is infeasible or
+ * hard (e.g., an opclass for a continuous type) can also use the fallback.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/skipsupport.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SKIPSUPPORT_H
+#define SKIPSUPPORT_H
+
+#include "utils/relcache.h"
+
+typedef struct SkipSupportData *SkipSupport;
+typedef Datum (*SkipSupportIncDec) (Relation rel,
+ Datum existing,
+ bool *overflow);
+
+/*
+ * State/callbacks used by skip arrays to procedurally generate elements.
+ *
+ * A BTSKIPSUPPORT_PROC function must set each and every field when called.
+ * If an opclass can only set some of the fields, then it cannot safely
+ * provide a skip support routine.
+ */
+typedef struct SkipSupportData
+{
+ /*
+ * low_elem and high_elem must be set with the lowest and highest possible
+ * values from the domain of indexable values (assuming standard ascending
+ * order). This helps the B-Tree code with finding its initial position
+ * at the leaf level (during the skip scan's first primitive index scan).
+ * In other words, it gives the B-Tree code a useful value to start from,
+ * before any data has been read from the index.
+ *
+ * low_elem and high_elem are also used by skip scans to determine when
+ * they've reached the final possible value (in the current direction).
+ * It's typical for the scan to run out of leaf pages before it runs out
+ * of unscanned indexable values, but it's still useful for the scan to
+ * have a way to recognize when it has reached the last possible value
+ * (this saves us a useless probe that just lands on the final leaf page).
+ */
+ Datum low_elem; /* lowest sorting/leftmost non-NULL value */
+ Datum high_elem; /* highest sorting/rightmost non-NULL value */
+
+ /*
+ * Decrement/increment functions.
+ *
+ * Returns a decremented/incremented copy of caller's existing datum,
+ * allocated in caller's memory context (in the case of pass-by-reference
+ * types). It's not okay for these functions to leak any memory.
+ *
+ * Both decrement and increment callbacks are guaranteed to never be
+ * called with a NULL "existing" arg.
+ *
+ * When the decrement function (or increment function) is called with a
+ * value that already matches low_elem (or high_elem), function must set
+ * the *overflow argument. The return value is undefined, and the B-Tree
+ * code is entitled to assume that no memory will have been allocated.
+ *
+ * The B-Tree skip scan caller's "existing" datum is often just a straight
+ * copy of a value from an index tuple. Operator classes must be liberal
+ * in accepting every possible representational variation within the
+ * underlying data type. On the other hand, opclasses are _not_ expected
+ * to preserve any information that doesn't affect how datums are sorted
+ * (e.g., skip support for a fixed precision numeric type isn't required
+ * to preserve datum display scale).
+ */
+ SkipSupportIncDec decrement;
+ SkipSupportIncDec increment;
+} SkipSupportData;
+
+extern bool PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype,
+ bool reverse, SkipSupport sksup);
+
+#endif /* SKIPSUPPORT_H */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index dcd04b813..dc99dad29 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -470,7 +470,8 @@ index_parallelscan_estimate(Relation indexRelation, int nkeys, int norderbys,
*/
if (indexRelation->rd_indam->amestimateparallelscan != NULL)
nbytes = add_size(nbytes,
- indexRelation->rd_indam->amestimateparallelscan(nkeys,
+ indexRelation->rd_indam->amestimateparallelscan(indexRelation,
+ nkeys,
norderbys));
return nbytes;
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 1c72867c8..deb387453 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -58,6 +58,7 @@
#include <limits.h>
#include "utils/fmgrprotos.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#ifdef STRESS_SORT_INT_MIN
@@ -78,6 +79,49 @@ btboolcmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32((int32) a - (int32) b);
}
+static Datum
+bool_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ if (bexisting == false)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return BoolGetDatum(bexisting - 1);
+}
+
+static Datum
+bool_increment(Relation rel, Datum existing, bool *overflow)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ if (bexisting == true)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return BoolGetDatum(bexisting + 1);
+}
+
+Datum
+btboolskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = bool_decrement;
+ sksup->increment = bool_increment;
+ sksup->low_elem = BoolGetDatum(false);
+ sksup->high_elem = BoolGetDatum(true);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint2cmp(PG_FUNCTION_ARGS)
{
@@ -105,6 +149,49 @@ btint2sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int2_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ if (iexisting == PG_INT16_MIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return Int16GetDatum(iexisting - 1);
+}
+
+static Datum
+int2_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ if (iexisting == PG_INT16_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return Int16GetDatum(iexisting + 1);
+}
+
+Datum
+btint2skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int2_decrement;
+ sksup->increment = int2_increment;
+ sksup->low_elem = Int16GetDatum(PG_INT16_MIN);
+ sksup->high_elem = Int16GetDatum(PG_INT16_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint4cmp(PG_FUNCTION_ARGS)
{
@@ -128,6 +215,49 @@ btint4sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int4_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ if (iexisting == PG_INT32_MIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return Int32GetDatum(iexisting - 1);
+}
+
+static Datum
+int4_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ if (iexisting == PG_INT32_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return Int32GetDatum(iexisting + 1);
+}
+
+Datum
+btint4skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int4_decrement;
+ sksup->increment = int4_increment;
+ sksup->low_elem = Int32GetDatum(PG_INT32_MIN);
+ sksup->high_elem = Int32GetDatum(PG_INT32_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint8cmp(PG_FUNCTION_ARGS)
{
@@ -171,6 +301,49 @@ btint8sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int8_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ if (iexisting == PG_INT64_MIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return Int64GetDatum(iexisting - 1);
+}
+
+static Datum
+int8_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ if (iexisting == PG_INT64_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return Int64GetDatum(iexisting + 1);
+}
+
+Datum
+btint8skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int8_decrement;
+ sksup->increment = int8_increment;
+ sksup->low_elem = Int64GetDatum(PG_INT64_MIN);
+ sksup->high_elem = Int64GetDatum(PG_INT64_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint48cmp(PG_FUNCTION_ARGS)
{
@@ -292,6 +465,49 @@ btoidsortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+oid_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ if (oexisting == InvalidOid)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return ObjectIdGetDatum(oexisting - 1);
+}
+
+static Datum
+oid_increment(Relation rel, Datum existing, bool *overflow)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ if (oexisting == OID_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return ObjectIdGetDatum(oexisting + 1);
+}
+
+Datum
+btoidskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = oid_decrement;
+ sksup->increment = oid_increment;
+ sksup->low_elem = ObjectIdGetDatum(InvalidOid);
+ sksup->high_elem = ObjectIdGetDatum(OID_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btoidvectorcmp(PG_FUNCTION_ARGS)
{
@@ -325,3 +541,48 @@ btcharcmp(PG_FUNCTION_ARGS)
/* Be careful to compare chars as unsigned */
PG_RETURN_INT32((int32) ((uint8) a) - (int32) ((uint8) b));
}
+
+static Datum
+char_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ if (cexisting == 0)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return CharGetDatum((uint8) cexisting - 1);
+}
+
+static Datum
+char_increment(Relation rel, Datum existing, bool *overflow)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ if (cexisting == UCHAR_MAX)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return CharGetDatum((uint8) cexisting + 1);
+}
+
+Datum
+btcharskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = char_decrement;
+ sksup->increment = char_increment;
+
+ /* btcharcmp compares chars as unsigned */
+ sksup->low_elem = UInt8GetDatum(0);
+ sksup->high_elem = UInt8GetDatum(UCHAR_MAX);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 689028dc5..d318a1d88 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -32,6 +32,7 @@
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
+#include "utils/datum.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
#include "utils/memutils.h"
@@ -71,7 +72,7 @@ typedef struct BTParallelScanDescData
* available for scan. see above for
* possible states of parallel scan. */
uint64 btps_nsearches; /* instrumentation */
- slock_t btps_mutex; /* protects above variables, btps_arrElems */
+ LWLock btps_lock; /* protects above variables, btps_arrElems */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
/*
@@ -79,11 +80,21 @@ typedef struct BTParallelScanDescData
* index scan. Holds BTArrayKeyInfo.cur_elem offsets for scan keys.
*/
int btps_arrElems[FLEXIBLE_ARRAY_MEMBER];
+
+ /*
+ * The reset of the space allocated in shared memory is also used when
+ * scans need to schedule another primitive index scan. It holds a
+ * flattened representation of the backend's skip array datums, if any.
+ */
} BTParallelScanDescData;
typedef struct BTParallelScanDescData *BTParallelScanDesc;
+static void _bt_parallel_serialize_arrays(Relation rel, BTParallelScanDesc btscan,
+ BTScanOpaque so);
+static void _bt_parallel_restore_arrays(Relation rel, BTParallelScanDesc btscan,
+ BTScanOpaque so);
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
@@ -539,10 +550,155 @@ btrestrpos(IndexScanDesc scan)
* btestimateparallelscan -- estimate storage for BTParallelScanDescData
*/
Size
-btestimateparallelscan(int nkeys, int norderbys)
+btestimateparallelscan(Relation rel, int nkeys, int norderbys)
{
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ Size estnbtreeshared,
+ genericattrspace;
+
/* Pessimistically assume all input scankeys will be output with arrays */
- return offsetof(BTParallelScanDescData, btps_arrElems) + sizeof(int) * nkeys;
+ estnbtreeshared = offsetof(BTParallelScanDescData, btps_arrElems) +
+ sizeof(int) * nkeys;
+
+ /*
+ * Assume every index attribute might require that we generate a skip scan
+ * key
+ */
+ genericattrspace = datumEstimateSpace((Datum) 0, false, true,
+ sizeof(Datum));
+ for (int attnum = 1; attnum <= nkeyatts; attnum++)
+ {
+ Form_pg_attribute attr;
+
+ /* Every skip array needs a space for storing sk_flags */
+ estnbtreeshared = add_size(estnbtreeshared, sizeof(int));
+
+ attr = TupleDescAttr(rel->rd_att, attnum - 1);
+ if (attr->attlen > 0)
+ {
+ /* Fixed length datum */
+ Size estfixed = datumEstimateSpace((Datum) 0, false,
+ attr->attbyval,
+ attr->attlen);
+
+ estnbtreeshared = add_size(estnbtreeshared, estfixed);
+ continue;
+ }
+
+ /*
+ * Varlena (or other variable-length) datum.
+ *
+ * Assume that serializing the arrays will use just as much space as a
+ * pass-by-value datum, in addition to a full third of a page of
+ * space.
+ */
+ estnbtreeshared = add_size(estnbtreeshared, genericattrspace);
+ estnbtreeshared = add_size(estnbtreeshared, BLCKSZ / 3);
+ }
+
+ return estnbtreeshared;
+}
+
+/*
+ * _bt_parallel_serialize_arrays() -- Serialize parallel array state.
+ *
+ * Caller must have exclusively locked btscan->btps_lock when called.
+ */
+static void
+_bt_parallel_serialize_arrays(Relation rel, BTParallelScanDesc btscan,
+ BTScanOpaque so)
+{
+ char *datumshared;
+
+ /* Space for serialized datums begins immediately after btps_arrElems[] */
+ datumshared = ((char *) &btscan->btps_arrElems[so->numArrayKeys]);
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *array = &so->arrayKeys[i];
+ ScanKey skey = &so->keyData[array->scan_key];
+ Form_pg_attribute attr;
+
+ if (array->num_elems != -1)
+ {
+ /* Serialize regular (non-skip) array */
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ btscan->btps_arrElems[i] = array->cur_elem;
+ continue;
+ }
+
+ /* Serialize skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ memcpy(datumshared, &skey->sk_flags, sizeof(int));
+ datumshared += sizeof(int);
+
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ {
+ /* No sk_argument to serialize */
+ Assert(skey->sk_argument == 0);
+ continue;
+ }
+
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ datumSerialize(skey->sk_argument, (skey->sk_flags & SK_ISNULL) != 0,
+ attr->attbyval, attr->attlen, &datumshared);
+ }
+}
+
+/*
+ * _bt_parallel_restore_arrays() -- Restore serialized parallel array state.
+ *
+ * Caller must have exclusively locked btscan->btps_lock when called.
+ */
+static void
+_bt_parallel_restore_arrays(Relation rel, BTParallelScanDesc btscan,
+ BTScanOpaque so)
+{
+ char *datumshared;
+
+ /* Space for serialized datums begins immediately after btps_arrElems[] */
+ datumshared = ((char *) &btscan->btps_arrElems[so->numArrayKeys]);
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *array = &so->arrayKeys[i];
+ ScanKey skey = &so->keyData[array->scan_key];
+ bool isnull;
+ Form_pg_attribute attr;
+
+ if (array->num_elems != -1)
+ {
+ /* Restore regular (non-skip) array */
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ array->cur_elem = btscan->btps_arrElems[i];
+ skey->sk_argument = array->elem_values[array->cur_elem];
+ continue;
+ }
+
+ /* Restore skip array */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = (Datum) 0;
+
+ /* Now that old sk_argument memory is freed, copy over sk_flags */
+ memcpy(&skey->sk_flags, datumshared, sizeof(int));
+ datumshared += sizeof(int);
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ {
+ /* No sk_argument to serialize */
+ continue;
+ }
+
+ skey->sk_argument = datumRestore(&datumshared, &isnull);
+ if (isnull)
+ {
+ Assert(skey->sk_argument == 0);
+ Assert(skey->sk_flags & SK_SEARCHNULL);
+ Assert(skey->sk_flags & SK_ISNULL);
+ }
+ }
}
/*
@@ -553,7 +709,8 @@ btinitparallelscan(void *target)
{
BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
- SpinLockInit(&bt_target->btps_mutex);
+ LWLockInitialize(&bt_target->btps_lock,
+ LWTRANCHE_PARALLEL_BTREE_SCAN);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
bt_target->btps_nsearches = 0;
@@ -575,15 +732,15 @@ btparallelrescan(IndexScanDesc scan)
parallel_scan->ps_offset);
/*
- * In theory, we don't need to acquire the spinlock here, because there
+ * In theory, we don't need to acquire the LWLock here, because there
* shouldn't be any other workers running at this point, but we do so for
* consistency.
*/
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
/* deliberately don't reset btps_nsearches (matches index_rescan) */
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
}
/*
@@ -611,6 +768,7 @@ btparallelrescan(IndexScanDesc scan)
bool
_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool exit_loop = false;
bool status = true;
@@ -650,7 +808,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
while (1)
{
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
if (btscan->btps_pageStatus == BTPARALLEL_DONE)
{
@@ -668,14 +826,10 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
if (first)
{
btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
- for (int i = 0; i < so->numArrayKeys; i++)
- {
- BTArrayKeyInfo *array = &so->arrayKeys[i];
- ScanKey skey = &so->keyData[array->scan_key];
- array->cur_elem = btscan->btps_arrElems[i];
- skey->sk_argument = array->elem_values[array->cur_elem];
- }
+ /* Restore scan's array keys from serialized values */
+ _bt_parallel_restore_arrays(rel, btscan, so);
+
so->needPrimScan = true;
so->scanBehind = false;
so->oppoDirCheck = false;
@@ -698,7 +852,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
*pageno = btscan->btps_scanPage;
exit_loop = true;
}
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
if (exit_loop || !status)
break;
ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
@@ -728,10 +882,10 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
btscan->btps_scanPage = scan_page;
btscan->btps_pageStatus = BTPARALLEL_IDLE;
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
ConditionVariableSignal(&btscan->btps_cv);
}
@@ -745,6 +899,7 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
void
_bt_parallel_done(IndexScanDesc scan)
{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
bool status_changed = false;
@@ -753,6 +908,17 @@ _bt_parallel_done(IndexScanDesc scan)
if (parallel_scan == NULL)
return;
+ /*
+ * Defensively disallow marking parallel scan done when this backend has a
+ * pending primitive index scan
+ *
+ * XXX Might be better to remove the call here made by _bt_first right
+ * after _bt_endpoint is called...since we don't have a similar call after
+ * _bt_search is called.
+ */
+ if (so->needPrimScan)
+ return;
+
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
@@ -760,7 +926,7 @@ _bt_parallel_done(IndexScanDesc scan)
* Mark the parallel scan as done, unless some other process did so
* already
*/
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
if (btscan->btps_pageStatus != BTPARALLEL_DONE)
{
btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -768,7 +934,7 @@ _bt_parallel_done(IndexScanDesc scan)
}
/* Copy the authoritative shared primitive scan counter to local field */
scan->nsearches = btscan->btps_nsearches;
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
/* wake up all the workers associated with this parallel scan */
if (status_changed)
@@ -786,6 +952,7 @@ _bt_parallel_done(IndexScanDesc scan)
void
_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
@@ -795,7 +962,7 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
if (btscan->btps_scanPage == prev_scan_page &&
btscan->btps_pageStatus == BTPARALLEL_IDLE)
{
@@ -804,14 +971,9 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
btscan->btps_nsearches++;
/* Serialize scan's current array keys */
- for (int i = 0; i < so->numArrayKeys; i++)
- {
- BTArrayKeyInfo *array = &so->arrayKeys[i];
-
- btscan->btps_arrElems[i] = array->cur_elem;
- }
+ _bt_parallel_serialize_arrays(rel, btscan, so);
}
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
}
/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e5f941e0a..ed5593c62 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -883,7 +883,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Buffer buf;
BTStack stack;
OffsetNumber offnum;
- StrategyNumber strat;
BTScanInsertData inskey;
ScanKey startKeys[INDEX_MAX_KEYS];
ScanKeyData notnullkeys[INDEX_MAX_KEYS];
@@ -975,7 +974,20 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* a > or < boundary or find an attribute with no boundary (which can be
* thought of as the same as "> -infinity"), we can't use keys for any
* attributes to its right, because it would break our simplistic notion
- * of what initial positioning strategy to use.
+ * of what initial positioning strategy to use. In practice skip scan
+ * typically enables us to use all scan keys here, even with a set of
+ * input keys that leave a "gap" between two index attributes (cases with
+ * multiple gaps will even manage this without any special restrictions).
+ *
+ * Skip scan works by having _bt_preprocess_keys cons up = boundary keys
+ * for any index columns that were missing a = key in scan->keyData[], the
+ * input scan keys passed to us by the executor. This happens for index
+ * attributes prior to the attribute of our final input scan key. The
+ * underlying = keys use skip arrays. The keys can be thought of as the
+ * same as "col = ANY('{every possible col value}')". Note that this
+ * often includes the array element NULL, which the scan will treat as an
+ * IS NULL qual (the skip array's scan key is already marked SK_SEARCHNULL
+ * when we're called, so we need no special handling for this case here).
*
* When the scan keys include cross-type operators, _bt_preprocess_keys
* may not be able to eliminate redundant keys; in such cases we will
@@ -1050,6 +1062,47 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
{
if (i >= so->numberOfKeys || cur->sk_attno != curattr)
{
+ if (chosen && (chosen->sk_flags & SK_BT_NEGPOSINF))
+ {
+ /* -inf/+inf element from a skip array's scan key */
+ ScanKey origchosen = chosen;
+ BTArrayKeyInfo *array = NULL;
+
+ for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
+ {
+ array = &so->arrayKeys[arridx];
+ if (array->scan_key == chosen - so->keyData)
+ break;
+ }
+
+ /* use array's inequality key in startKeys[] */
+ if (ScanDirectionIsForward(dir))
+ chosen = array->low_compare;
+ else
+ chosen = array->high_compare;
+
+ Assert(!chosen ||
+ chosen->sk_attno == origchosen->sk_attno);
+
+ if (!array->null_elem)
+ {
+ /*
+ * The array does not include a NULL element (meaning
+ * array advancement never generates an IS NULL qual).
+ * We'll deduce a NOT NULL key to skip over any NULLs
+ * when there's no usable low_compare (or no usable
+ * high_compare, during a backwards scan).
+ *
+ * Note: this also handles an explicit NOT NULL key
+ * that preprocessing folded into the skip array (it
+ * doesn't save them in low_compare/high_compare).
+ */
+ impliesNN = origchosen;
+ }
+ else
+ Assert(chosen == NULL && impliesNN == NULL);
+ }
+
/*
* Done looking at keys for curattr. If we didn't find a
* usable boundary key, see if we can deduce a NOT NULL key.
@@ -1083,16 +1136,42 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
break;
startKeys[keysz++] = chosen;
+ if (chosen->sk_flags & SK_BT_NEXTPRIOR)
+ {
+ /*
+ * Next/prior key element from a skip array's scan key.
+ * 'chosen' could be SK_ISNULL, in which case startKeys[]
+ * positions us at the first tuple > NULL (for backwards
+ * scans it's the first tuple < NULL instead).
+ *
+ * Adjust strat_total, so that our = key gets treated like
+ * a > key (or like a < key) within _bt_search.
+ */
+ Assert(strat_total == BTEqualStrategyNumber);
+ if (ScanDirectionIsForward(dir))
+ strat_total = BTGreaterStrategyNumber;
+ else
+ strat_total = BTLessStrategyNumber;
+
+ /*
+ * We'll never find an exact = match for a NEXTPRIOR
+ * sentinel sk_argument value, so there's no reason to
+ * save any later would-be boundary keys in startKeys[]
+ * (besides, doing so would confuse _bt_search, since it
+ * isn't directly aware of NEXTPRIOR sentinel values)
+ */
+ break;
+ }
+
/*
* Adjust strat_total, and quit if we have stored a > or <
* key.
*/
- strat = chosen->sk_strategy;
- if (strat != BTEqualStrategyNumber)
+ if (chosen->sk_strategy != BTEqualStrategyNumber)
{
- strat_total = strat;
- if (strat == BTGreaterStrategyNumber ||
- strat == BTLessStrategyNumber)
+ strat_total = chosen->sk_strategy;
+ if (chosen->sk_strategy == BTGreaterStrategyNumber ||
+ chosen->sk_strategy == BTLessStrategyNumber)
break;
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d1423bd85..5a7b1ace4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -29,9 +29,37 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+/*
+ * GUC parameters (temporary convenience for reviewers).
+ *
+ * To disable all skipping, set skipscan_prefix_cols=0. Otherwise set it to
+ * the attribute number that you wish to make the last attribute number that
+ * we can add a skip scan key for. For example, skipscan_prefix_cols=1 makes
+ * an index scan with qual "WHERE b = 1 AND c > 42" generate a skip scan key
+ * on the column 'a' (which is attnum 1) only, preventing us from adding one
+ * for the column 'c' (and so 'c' will still have an inequality scan key,
+ * required in only one direction -- 'c' won't be output as a "range" skip
+ * key/array).
+ */
+int skipscan_prefix_cols = INDEX_MAX_KEYS;
+
+/*
+ * skipscan_skipsupport_enabled can be used to avoid using opclass skip
+ * support routines. This can be used to quantify the peformance benefit that
+ * comes from having dedicated skip support, with a given test query.
+ */
+bool skipscan_skipsupport_enabled = true;
+
#define LOOK_AHEAD_REQUIRED_RECHECKS 3
#define LOOK_AHEAD_DEFAULT_DISTANCE 5
+typedef struct BTSkipPreproc
+{
+ SkipSupportData sksup; /* opclass skip scan support (optional) */
+ bool use_sksup; /* sksup set to valid routine? */
+ Oid eq_op; /* InvalidOid means don't skip */
+} BTSkipPreproc;
+
typedef struct BTSortArrayContext
{
FmgrInfo *sortproc;
@@ -64,15 +92,38 @@ static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
bool *qual_ok);
static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys);
static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
+static int _bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts);
+static bool _bt_skipsupport(Relation rel, int add_skip_attno,
+ BTSkipPreproc *skipatts);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur);
+ Datum arrdatum, bool arrnull,
+ ScanKey cur);
+static void _bt_array_preproc_shrink(ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
+static bool _bt_skip_preproc_shrink(IndexScanDesc scan, ScanKey arraysk,
+ ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
static int _bt_binsrch_array_skey(FmgrInfo *orderproc,
bool cur_elem_trig, ScanDirection dir,
Datum tupdatum, bool tupnull,
BTArrayKeyInfo *array, ScanKey cur,
int32 *set_elem_result);
+static void _bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result);
+static void _bt_scankey_set_low_or_high(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array, bool low_not_high);
+static void _bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull);
+static void _bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static bool _bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static bool _bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
@@ -258,11 +309,19 @@ _bt_freestack(BTStack stack)
* preprocessing steps are complete. This will convert the scan key offset
* references into references to the scan's so->keyData[] output scan keys.
*
+ * We're also responsible for generating skip arrays (and their associated
+ * scan keys) here. This enables skip scan. We do this for index attributes
+ * that initially lacked an equality condition within scan->keyData[], iff
+ * doing so allows a later scan key (that was passed to us in scan->keyData[])
+ * to be marked required by later preprocessing on output.
+ * _bt_decide_skipatts decides which attributes receive skip arrays.
+ *
* Caller must pass *numberOfKeys to give us a way to change the number of
* input scan keys (our output is caller's input). The returned array can be
* smaller than scan->keyData[] when we eliminated a redundant array scan key
- * (redundant with some other array scan key, for the same attribute). Caller
- * uses this to allocate so->keyData[] for the current btrescan.
+ * (redundant with some other array scan key, for the same attribute). It can
+ * also be larger when we added a skip array/skip scan key. Caller uses this
+ * to allocate so->keyData[] for the current btrescan.
*
* Note: the reason we need to return a temp scan key array, rather than just
* scribbling on scan->keyData, is that callers are permitted to call btrescan
@@ -275,8 +334,11 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
Relation rel = scan->indexRelation;
int numArrayKeyData = scan->numberOfKeys;
int16 *indoption = rel->rd_indoption;
+ BTSkipPreproc skipatts[INDEX_MAX_KEYS];
int numArrayKeys,
+ numSkipArrayKeys,
output_ikey = 0;
+ AttrNumber attno_skip = 1;
int origarrayatt = InvalidAttrNumber,
origarraykey = -1;
Oid origelemtype = InvalidOid;
@@ -286,7 +348,10 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
Assert(scan->numberOfKeys);
- /* Quick check to see if there are any array keys */
+ /*
+ * Quick check to see if there are any array keys, or any missing keys we
+ * can generate a "skip scan" array key for ourselves
+ */
numArrayKeys = 0;
for (int i = 0; i < scan->numberOfKeys; i++)
{
@@ -304,6 +369,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
}
}
+ numSkipArrayKeys = _bt_decide_skipatts(scan, skipatts);
+ if (numSkipArrayKeys)
+ {
+ /* At least one skip array scan key must be added to arrayKeyData[] */
+ numArrayKeys += numSkipArrayKeys;
+ /* output scan key buffer allocation needs space for skip scan keys */
+ numArrayKeyData += numSkipArrayKeys;
+ }
+
/* Quit if nothing to do. */
if (numArrayKeys == 0)
return NULL;
@@ -330,7 +404,12 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
/* Allocate space for ORDER procs used to help _bt_checkkeys */
so->orderProcs = (FmgrInfo *) palloc(numArrayKeyData * sizeof(FmgrInfo));
- /* Now process each array key */
+ /*
+ * Process each array key, and generate skip arrays as needed. Also copy
+ * every scan->keyData[] input scan key (whether it's an array or not)
+ * into the arrayKeyData array we'll return to our caller (barring any
+ * array scan keys that we could eliminate early through array merging).
+ */
numArrayKeys = 0;
for (int input_ikey = 0; input_ikey < scan->numberOfKeys; input_ikey++)
{
@@ -348,8 +427,76 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
int num_nonnulls;
int j;
+ /* Create a skip array and scan key where indicated by skipatts */
+ while (numSkipArrayKeys &&
+ attno_skip <= scan->keyData[input_ikey].sk_attno)
+ {
+ Oid opcintype = rel->rd_opcintype[attno_skip - 1];
+ Oid collation = rel->rd_indcollation[attno_skip - 1];
+ Oid eq_op = skipatts[attno_skip - 1].eq_op;
+ RegProcedure cmp_proc;
+
+ if (!OidIsValid(eq_op))
+ {
+ /* won't skip using this attribute */
+ attno_skip++;
+ continue;
+ }
+
+ cmp_proc = get_opcode(eq_op);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing oprcode for skipping equals operator %u", eq_op);
+
+ cur = &arrayKeyData[output_ikey];
+ Assert(attno_skip <= scan->keyData[input_ikey].sk_attno);
+ ScanKeyEntryInitialize(cur,
+ SK_SEARCHARRAY | SK_BT_SKIP, /* flags */
+ attno_skip, /* skipped att number */
+ BTEqualStrategyNumber, /* equality strategy */
+ InvalidOid, /* opclass input subtype */
+ collation, /* index column's collation */
+ cmp_proc, /* equality operator's proc */
+ (Datum) 0); /* constant */
+
+ /* Initialize array fields */
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
+ so->arrayKeys[numArrayKeys].num_elems = -1;
+ so->arrayKeys[numArrayKeys].cur_elem = 0;
+ so->arrayKeys[numArrayKeys].elem_values = NULL; /* unusued */
+ so->arrayKeys[numArrayKeys].use_sksup = skipatts[attno_skip - 1].use_sksup;
+ so->arrayKeys[numArrayKeys].null_elem = true; /* for now */
+ so->arrayKeys[numArrayKeys].sksup = skipatts[attno_skip - 1].sksup;
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* for now */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* for now */
+
+ /*
+ * Temporary testing GUC can disable the use of an opclass's skip
+ * support routine
+ */
+ if (!skipscan_skipsupport_enabled)
+ so->arrayKeys[numArrayKeys].use_sksup = false;
+
+ /*
+ * We'll need a 3-way ORDER proc to determine when and how the
+ * consed-up "array" will advance inside _bt_advance_array_keys.
+ * Set one up now.
+ */
+ _bt_setup_array_cmp(scan, cur, opcintype,
+ &so->orderProcs[output_ikey], NULL);
+
+ /*
+ * Prepare to output next scan key (might be another skip scan
+ * key, or it could be an input scan key from scan->keyData[])
+ */
+ numSkipArrayKeys--;
+ numArrayKeys++;
+ attno_skip++;
+ output_ikey++; /* keep this scan key/array */
+ }
+
/*
- * Copy input scan key into temp arrayKeyData scan key array
+ * Copy input scan key into temp arrayKeyData scan key array. (From
+ * here on, cur points at our copy of the input scan key.)
*/
cur = &arrayKeyData[output_ikey];
*cur = scan->keyData[input_ikey];
@@ -521,6 +668,10 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
so->arrayKeys[numArrayKeys].scan_key = output_ikey;
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
+ so->arrayKeys[numArrayKeys].null_elem = false; /* unused */
+ so->arrayKeys[numArrayKeys].use_sksup = false; /* redundant */
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* unused */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* unused */
numArrayKeys++;
output_ikey++; /* keep this scan key/array */
}
@@ -634,7 +785,8 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
{
BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
- Assert(array->num_elems > 0);
+ Assert(array->num_elems > 0 || array->num_elems == -1);
+ Assert(array->num_elems != -1 || outkey->sk_flags & SK_BT_REQFWD);
if (array->scan_key == input_ikey)
{
@@ -685,7 +837,7 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
* Parallel index scans require space in shared memory to store the
* current array elements (for arrays kept by preprocessing) to schedule
* the next primitive index scan. The underlying structure is protected
- * using a spinlock, so defensively limit its size. In practice this can
+ * using an LWLock, so defensively limit its size. In practice this can
* only affect parallel scans that use an incomplete opfamily.
*/
if (scan->parallel_scan && so->numArrayKeys > INDEX_MAX_KEYS)
@@ -695,6 +847,191 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
so->numArrayKeys, INDEX_MAX_KEYS)));
}
+/*
+ * _bt_decide_skipatts() -- set index attributes requiring skip arrays
+ *
+ * _bt_preprocess_array_keys helper function. Determines which attributes
+ * will require skip arrays/scan keys. Also sets up skip support callbacks
+ * for attributes whose input opclass have skip support (opclasses without
+ * skip support will fall back on using next-key sentinel values when
+ * advancing the skip array to its next array element).
+ *
+ * Return value is the total number of scan keys to add as "input" scan keys
+ * for further processing within _bt_preprocess_keys.
+ */
+static int
+_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)
+{
+ Relation rel = scan->indexRelation;
+ ScanKey inputsk;
+ AttrNumber attno_inputsk = 1,
+ attno_skip = 1;
+ bool attno_has_equal = false,
+ attno_has_rowcompare = false;
+ int numSkipArrayKeys = 0;
+
+ Assert(scan->numberOfKeys);
+
+ /*
+ * Only add skip arrays (and associated scan keys) when doing so will
+ * enable _bt_preprocess_keys to mark one or more lower-order input scan
+ * keys (user-visible scan keys taken from scan->keyData[] input array) as
+ * required to continue the scan
+ */
+ inputsk = &scan->keyData[0];
+ for (int i = 0;; inputsk++, i++)
+ {
+ int prev_numSkipArrayKeys = numSkipArrayKeys;
+
+ /*
+ * Backfill skip arrays for any wholly omitted attributes prior to
+ * attno_inputsk
+ */
+ while (attno_skip < attno_inputsk)
+ {
+ if (!_bt_skipsupport(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /*
+ * Cannot generate a skip array for this or later attributes
+ * (input opclass lacks an equality strategy operator)
+ */
+ return prev_numSkipArrayKeys;
+ }
+
+ /* plan on adding a backfill skip array for this attribute */
+ numSkipArrayKeys++;
+ attno_skip++;
+ }
+
+ /*
+ * Stop once past the final input scan key. We deliberately never add
+ * a skip attribute for the attribute of the last input scan key.
+ *
+ * If the last input scan key(s) use equality strategy, then a skip
+ * attribute is superfluous at best. If the last input scan key uses
+ * an inequality strategy, then adding a skip scan array/scan key is a
+ * valid though suboptimal transformation. It is better to arrange
+ * for preprocessing to allow such an input inequality scan key to
+ * remain an inequality on output. That way _bt_checkkeys will be
+ * able to make best use of both of its precheck optimizations, but
+ * _bt_first will be no less capable of efficiently finding the
+ * starting position for each primitive index scan.
+ */
+ if (i >= scan->numberOfKeys)
+ break;
+
+ /*
+ * Cannot keep adding skip arrays after a RowCompare
+ */
+ if (attno_has_rowcompare)
+ break;
+
+ /*
+ * Apply temporary testing GUC that can be used to disable skipping
+ * (either in part or in whole)
+ */
+ if (attno_inputsk > skipscan_prefix_cols)
+ break;
+
+ /*
+ * Now consider next attno_inputsk (or keep going if this is an
+ * additional scan key against the same attribute)
+ */
+ if (attno_inputsk < inputsk->sk_attno)
+ {
+ /*
+ * Now add skip array for previous scan key's attribute, though
+ * only if the attribute has no equality strategy scan keys.
+ *
+ * Adding skip arrays to an attribute that has one or more
+ * inequality scan keys will cause preprocessing to output a range
+ * skip array. This will happen when preprocessing proper deals
+ * with the redundancy between the array and its inequalities.
+ */
+ skipatts[attno_skip - 1].eq_op = InvalidOid;
+ if (attno_has_equal)
+ {
+ /* don't skip, attribute already has an input equality key */
+ }
+ else if (_bt_skipsupport(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /*
+ * saw no equalities for the prior attribute, so add a range
+ * skip array for this attribute
+ */
+ numSkipArrayKeys++;
+ }
+ else
+ {
+ /*
+ * Cannot generate a skip array for this or later attributes
+ * (input opclass lacks an equality strategy operator)
+ */
+ return numSkipArrayKeys;
+ }
+
+ /* Set things up for this new attribute */
+ attno_skip++;
+ attno_inputsk = inputsk->sk_attno;
+ attno_has_equal = false;
+ }
+
+ /*
+ * Track if this attribute's scan keys have any equality strategy scan
+ * keys (while counting IS NULL scan keys as equality scan keys)
+ */
+ if (inputsk->sk_strategy == BTEqualStrategyNumber ||
+ (inputsk->sk_flags & SK_SEARCHNULL))
+ attno_has_equal = true;
+
+ /*
+ * We don't support RowCompare transformation. Remember that we saw a
+ * RowCompare, so that we don't keep adding skip attributes. (We may
+ * still backfill skip attributes before the RowCompare, so that it
+ * will be marked required.)
+ */
+ if (inputsk->sk_flags & SK_ROW_HEADER)
+ attno_has_rowcompare = true;
+ }
+
+ return numSkipArrayKeys;
+}
+
+/*
+ * _bt_skipsupport() -- set up skip support function in *skipatts
+ *
+ * Returns true on success, indicating that we set *skipatts with input
+ * opclass's equality operator. Otherwise returns false.
+ */
+static bool
+_bt_skipsupport(Relation rel, int add_skip_attno, BTSkipPreproc *skipatts)
+{
+ int16 *indoption = rel->rd_indoption;
+ Oid opfamily = rel->rd_opfamily[add_skip_attno - 1];
+ Oid opcintype = rel->rd_opcintype[add_skip_attno - 1];
+ bool reverse;
+
+ /* Look up input opclass's equality operator (might fail) */
+ skipatts->eq_op = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ /*
+ * We don't really expect input opclasses lacking even an equality
+ * operator, but they're still supported. Deal with them gracefully.
+ */
+ if (!OidIsValid(skipatts->eq_op))
+ return false;
+
+ /* Have skip support infrastructure set all SkipSupport fields */
+ reverse = (indoption[add_skip_attno - 1] & INDOPTION_DESC) != 0;
+ skipatts->use_sksup = PrepareSkipSupportFromOpclass(opfamily, opcintype,
+ reverse,
+ &skipatts->sksup);
+
+ /* might not have set up skip support routine, but can skip either way */
+ return true;
+}
+
/*
* _bt_setup_array_cmp() -- Set up array comparison functions
*
@@ -987,17 +1324,15 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok)
{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
Oid opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
- int cmpresult = 0,
- cmpexact = 0,
- matchelem,
- new_nelems = 0;
FmgrInfo crosstypeproc;
FmgrInfo *orderprocp = orderproc;
+ MemoryContext oldContext;
+ bool eliminated;
Assert(arraysk->sk_attno == skey->sk_attno);
- Assert(array->num_elems > 0);
Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
arraysk->sk_strategy == BTEqualStrategyNumber);
@@ -1010,8 +1345,8 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
* datum of opclass input type for the index's attribute (on-disk type).
* We can reuse the array's ORDER proc whenever the non-array scan key's
* type is a match for the corresponding attribute's input opclass type.
- * Otherwise, we have to do another ORDER proc lookup so that our call to
- * _bt_binsrch_array_skey applies the correct comparator.
+ * Otherwise, we have to do another ORDER proc lookup. We have to be sure
+ * that _bt_compare_array_skey/_bt_binsrch_array_skey use the right proc.
*
* Note: we have to support the convention that sk_subtype == InvalidOid
* means the opclass input type; this is a hack to simplify life for
@@ -1042,11 +1377,65 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
return false;
}
- /* We have all we need to determine redundancy/contradictoriness */
+ /* We successfully looked up the required cross-type ORDER proc */
orderprocp = &crosstypeproc;
fmgr_info(cmp_proc, orderprocp);
}
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+
+ /*
+ * Perform preprocessing of the array based on whether it's a conventional
+ * array, or a skip array. Sets *qual_ok correctly in passing.
+ */
+ if (array->num_elems != -1)
+ {
+ _bt_array_preproc_shrink(arraysk, skey, orderprocp, array, qual_ok);
+
+ /*
+ * We successfully looked up the required cross-type ORDER proc, which
+ * ensured that the scalar scan key could be eliminated as redundant
+ */
+ eliminated = true;
+ }
+ else
+ {
+ /*
+ * With a skip array it's possible that we won't be able to eliminate
+ * the scalar scan key, despite looking up the required ORDER proc.
+ * This happens when earlier preprocessing wasn't able to eliminate a
+ * redundant scan key inequality due to a lack of cross-type support.
+ */
+ eliminated = _bt_skip_preproc_shrink(scan, arraysk, skey, orderprocp,
+ array, qual_ok);
+ }
+
+ MemoryContextSwitchTo(oldContext);
+
+ return eliminated;
+}
+
+/*
+ * Finish off preprocessing of conventional (non-skip) array scan key when it
+ * is redundant with (or contradicted by) a non-array scalar scan key.
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Rewrites caller's array in-place as needed to eliminate redundant array
+ * elements. Calling here always renders caller's scalar scan key redundant.
+ */
+static void
+_bt_array_preproc_shrink(ScanKey arraysk, ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ int cmpresult = 0,
+ cmpexact = 0,
+ matchelem,
+ new_nelems = 0;
+
+ Assert(array->num_elems > 0);
+ Assert(!(arraysk->sk_flags & SK_BT_SKIP));
+
matchelem = _bt_binsrch_array_skey(orderprocp, false,
NoMovementScanDirection,
skey->sk_argument, false, array,
@@ -1098,6 +1487,137 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
array->num_elems = new_nelems;
*qual_ok = new_nelems > 0;
+}
+
+/*
+ * Finish off preprocessing of skip array scan key when it is "redundant with"
+ * a non-array scalar scan key. The scalar scan key must be an inequality.
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Unlike _bt_array_preproc_shrink, we cannot really modify caller's array
+ * in-place. Skip arrays work by procedurally generating their elements as
+ * needed, so our approach is to store a copy of the inequality in the skip
+ * array, allowing its elements to be generated within the limits of a range.
+ * Calling here always renders caller's scalar scan key redundant (the key is
+ * applied when the array advances, but that's just an implementation detail).
+ *
+ * Return value indicates if the array already had a lower/upper bound
+ * (whichever caller's scalar scan key was expected to be). We return true in
+ * the common case where caller's scan key could be successfully rolled into
+ * the skip array. We return false when we can't do that due to the presence
+ * of a conflicting inequality.
+ */
+static bool
+_bt_skip_preproc_shrink(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp, BTArrayKeyInfo *array,
+ bool *qual_ok)
+{
+ bool test_result;
+
+ /*
+ * We don't expect to have to deal with NULLs in non-array/non-skip scan
+ * key. We expect _bt_preprocess_array_keys to avoid generating a skip
+ * array for an index attribute with an IS NULL input scan key. (It will
+ * still do so in the presence of IS NOT NULL input scan keys, but
+ * _bt_compare_scankey_args is expected to handle those for us.)
+ */
+ Assert(arraysk->sk_flags & SK_BT_SKIP);
+ Assert(arraysk->sk_flags & SK_SEARCHARRAY);
+ Assert(arraysk->sk_strategy == BTEqualStrategyNumber);
+ Assert(array->num_elems == -1);
+
+ /* Scalar scan key must be a B-Tree inequality, which are always strict */
+ Assert(!(skey->sk_flags & SK_ISNULL));
+ Assert(skey->sk_strategy != BTEqualStrategyNumber);
+
+ /*
+ * Array must not generate a NULL array element (for "IS NULL" qual). Its
+ * index attribute is constrained by a strict operator, so NULL elements
+ * must not be returned by the scan (it would be wrong to allow it).
+ */
+ array->null_elem = false;
+ *qual_ok = true;
+
+ /*
+ * Store a copy of caller's scalar scan key, plus a copy of the operator's
+ * corresponding 3-way ORDER proc.
+ *
+ * A skip array scan key always uses the underlying index attribute's
+ * input opclass, but it's possible that caller's scalar scan key uses a
+ * cross-type operator. In cross-type scenarios, skey.sk_argument doesn't
+ * use the same type as later array elements (which are all just copies of
+ * datums taken from index tuples, possibly modified by skip support).
+ *
+ * We represent the lowest (and highest) possible value in the array using
+ * the sentinel value -inf (+inf for high_compare). The only exceptions
+ * apply when the opclass has skip support: there we can use a copy of the
+ * skip support routine's low_elem/high_elem instead -- though only when
+ * there is no corresponding low_compare/high_compare inequality.
+ *
+ * _bt_first understands that -inf/+inf indicate that it should use the
+ * low_compare/high_compare inequality for initial positioning purposes
+ * when it sees either value (unless there is no corresponding inequality,
+ * in which case the values are literally interpreted as -inf or +inf).
+ * _bt_first can therefore vary in whether it uses a cross-type operator,
+ * or an input-opclass-only operator (it can vary across primitive scans
+ * for the same index attribute/skip array).
+ *
+ * _bt_scankey_decrement/_bt_scankey_increment both make sure that each
+ * newly generated element is constrained by low_compare/high_compare.
+ * This must happen without skey.sk_argument ever being treated as a true
+ * array element (that wouldn't always work because array elements are
+ * only ever supposed to use the opclass input type).
+ */
+ switch (skey->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ case BTLessEqualStrategyNumber:
+ if (array->high_compare)
+ {
+ /* try to keep only one high_compare inequality */
+ if (!_bt_compare_scankey_args(scan, array->high_compare, skey,
+ array->high_compare, NULL, NULL,
+ &test_result))
+ return false; /* can't make new high_compare redundant */
+
+ if (!test_result)
+ return true; /* discard new high_compare */
+
+ /* replace old high_compare with new one */
+ }
+ else
+ array->high_compare = palloc(sizeof(ScanKeyData));
+
+ memcpy(array->high_compare, skey, sizeof(ScanKeyData));
+ array->order_high = *orderprocp;
+ break;
+ case BTGreaterEqualStrategyNumber:
+ case BTGreaterStrategyNumber:
+ if (array->low_compare)
+ {
+ /* try to keep only one low_compare inequality */
+ if (!_bt_compare_scankey_args(scan, array->low_compare, skey,
+ array->low_compare, NULL, NULL,
+ &test_result))
+ return false; /* can't make new low_compare redundant */
+
+ if (!test_result)
+ return true; /* discard new low_compare */
+
+ /* replace old low_compare with new one */
+ }
+ else
+ array->low_compare = palloc(sizeof(ScanKeyData));
+
+ memcpy(array->low_compare, skey, sizeof(ScanKeyData));
+ array->order_low = *orderprocp;
+ break;
+ default:
+ elog(ERROR, "unrecognized StrategyNumber: %d",
+ (int) skey->sk_strategy);
+ break;
+ }
return true;
}
@@ -1140,7 +1660,8 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
static inline int32
_bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur)
+ Datum arrdatum, bool arrnull,
+ ScanKey cur)
{
int32 result = 0;
@@ -1148,14 +1669,14 @@ _bt_compare_array_skey(FmgrInfo *orderproc,
if (tupnull) /* NULL tupdatum */
{
- if (cur->sk_flags & SK_ISNULL)
+ if (arrnull)
result = 0; /* NULL "=" NULL */
else if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+ else if (arrnull) /* NOT_NULL tupdatum, NULL arrdatum */
{
if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -1221,6 +1742,8 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
Datum arrdatum;
Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(!(cur->sk_flags & SK_BT_SKIP));
+ Assert(!(cur->sk_flags & SK_ISNULL)); /* plain arrays can't do this */
Assert(cur->sk_strategy == BTEqualStrategyNumber);
if (cur_elem_trig)
@@ -1256,7 +1779,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[low_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result <= 0)
{
@@ -1284,7 +1807,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[high_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result >= 0)
{
@@ -1311,7 +1834,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
arrdatum = array->elem_values[mid_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result == 0)
{
@@ -1336,13 +1859,102 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
*/
if (low_elem != mid_elem)
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- array->elem_values[low_elem], cur);
+ array->elem_values[low_elem], false,
+ cur);
*set_elem_result = result;
return low_elem;
}
+/*
+ * _bt_binsrch_skiparray_skey() -- "Binary search" within a skip array
+ *
+ * This routine doesn't return an index into the array, because the array
+ * doesn't actually have any elements (it generates its array elements
+ * procedurally instead). Note that this may include a NULL value/an IS NULL
+ * qual.
+ *
+ * Sets *set_elem_result just like _bt_binsrch_array_skey would with a true
+ * array. The value 0 indicates that tupdatum/tupnull is within the range of
+ * the skip array. Other values indicate what _bt_compare_array_skey returned
+ * for the best available match to tupdatum/tupnull (in practice this means
+ * either the lowest item or the highest item in the range of the array).
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this array's
+ * scan key. We use this to optimize-away comparisons that are known by our
+ * caller to be unnecessary from context, just like _bt_binsrch_array_skey.
+ */
+static void
+_bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result)
+{
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_flags & SK_BT_REQFWD);
+ Assert(array->num_elems == -1);
+ Assert(!ScanDirectionIsNoMovement(dir));
+
+ if (tupnull) /* NULL tupdatum */
+ {
+ if (array->null_elem)
+ *set_elem_result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ *set_elem_result = -1; /* NULL "<" NOT_NULL */
+ else
+ *set_elem_result = 1; /* NULL ">" NOT_NULL */
+
+ return;
+ }
+
+ /*
+ * Array inequalities determine whether tupdatum is within the range of
+ * caller's skip array
+ */
+ *set_elem_result = 0;
+ if (ScanDirectionIsForward(dir))
+ {
+ /*
+ * Evaluate low_compare first (unless cur_elem_trig tells us that it
+ * cannot possibly fail to be satisfied), then evaluate high_compare
+ */
+ if (!cur_elem_trig && array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ tupdatum,
+ array->low_compare->sk_argument)))
+ *set_elem_result = -1;
+ else if (array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ tupdatum,
+ array->high_compare->sk_argument)))
+ *set_elem_result = 1;
+ }
+ else
+ {
+ /*
+ * Evaluate high_compare first (unless cur_elem_trig tells us that it
+ * cannot possibly fail to be satisfied), then evaluate low_compare
+ */
+ if (!cur_elem_trig && array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ tupdatum,
+ array->high_compare->sk_argument)))
+ *set_elem_result = 1;
+ else if (array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ tupdatum,
+ array->low_compare->sk_argument)))
+ *set_elem_result = -1;
+ }
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -1352,29 +1964,496 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
void
_bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- int i;
Assert(so->numArrayKeys);
Assert(so->qual_ok);
- for (i = 0; i < so->numArrayKeys; i++)
+ for (int i = 0; i < so->numArrayKeys; i++)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
- Assert(curArrayKey->num_elems > 0);
Assert(skey->sk_flags & SK_SEARCHARRAY);
- if (ScanDirectionIsBackward(dir))
- curArrayKey->cur_elem = curArrayKey->num_elems - 1;
- else
- curArrayKey->cur_elem = 0;
- skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
+ _bt_scankey_set_low_or_high(rel, skey, curArrayKey,
+ ScanDirectionIsForward(dir));
}
so->scanBehind = so->oppoDirCheck = false; /* reset */
}
+/*
+ * _bt_scankey_set_low_or_high() -- Set array scan key to lowest/highest element
+ *
+ * Caller also passes associated scan key, which will have its argument set to
+ * the lowest/highest array value in passing.
+ */
+static void
+_bt_scankey_set_low_or_high(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ bool low_not_high)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (array->num_elems != -1)
+ {
+ /* set low or high element for conventional array */
+ int set_elem = 0;
+
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+
+ if (!low_not_high)
+ set_elem = array->num_elems - 1;
+
+ /*
+ * Just copy over array datum (only skip arrays require freeing and
+ * allocating memory for sk_argument)
+ */
+ array->cur_elem = set_elem;
+ skey->sk_argument = array->elem_values[set_elem];
+
+ return;
+ }
+
+ /* set low or high element for skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Clear possibly-irrelevant flags */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR);
+
+ if (array->null_elem &&
+ (low_not_high == ((skey->sk_flags & SK_BT_NULLS_FIRST) != 0)))
+ {
+ /* Lowest (or highest) element is NULL, so set scan key to NULL */
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+ }
+ else if (low_not_high)
+ {
+ /* Lowest array element isn't NULL */
+ if (array->use_sksup && !array->low_compare)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= SK_BT_NEGPOSINF;
+ }
+ else
+ {
+ /* Highest array element isn't NULL */
+ if (array->use_sksup && !array->high_compare)
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= SK_BT_NEGPOSINF;
+ }
+}
+
+/*
+ * _bt_scankey_set_element() -- Set skip array scan key's sk_argument
+ *
+ * Sets scan key to "IS NULL" when required, and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull)
+{
+ /* tupdatum within the range of low_value/high_value */
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(tupnull && !array->null_elem));
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR);
+ if (!tupnull)
+ skey->sk_argument = datumCopy(tupdatum, attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
+/*
+ * _bt_scankey_unset_isnull() -- increment/decrement scan key from NULL
+ *
+ * Unsets scan key's "IS NULL" marking, and sets the non-NULL value from the
+ * array immediately before (or immediate after) NULL in the key space.
+ */
+static void
+_bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(skey->sk_flags & SK_SEARCHNULL);
+ Assert(skey->sk_flags & SK_ISNULL);
+ Assert(!(skey->sk_flags & (SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR)));
+ Assert(skey->sk_argument == 0);
+ Assert(array->use_sksup && array->null_elem &&
+ !array->low_compare && !array->high_compare);
+
+ /*
+ * sk_argument must be set to whatever non-NULL value comes immediately
+ * before or after NULL
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL);
+ if (skey->sk_flags & SK_BT_NULLS_FIRST)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+}
+
+/*
+ * _bt_scankey_set_isnull() -- decrement/increment scan key to NULL
+ */
+static void
+_bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & (SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR)));
+ Assert(array->null_elem);
+ Assert(!array->low_compare && !array->high_compare);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set sk_argument to NULL */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
+/*
+ * _bt_scankey_decrement() -- decrement array scan key's sk_argument
+ *
+ * Return value indicates whether caller's array was successfully decremented.
+ * Cannot decrement an array whose current element is already the first one.
+ */
+static bool
+_bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ bool underflow = false;
+ Datum dec_sk_argument;
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_BT_NEXTPRIOR));
+
+ /* Regular (non-skip) array? */
+ if (array->num_elems != -1)
+ {
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ if (array->cur_elem > 0)
+ {
+ /*
+ * Just copy over array datum (only skip arrays require freeing
+ * and allocating memory for sk_argument)
+ */
+ array->cur_elem--;
+ skey->sk_argument = array->elem_values[array->cur_elem];
+
+ /* Successfully decremented array */
+ return true;
+ }
+
+ /* Cannot decrement to before first array element */
+ return false;
+ }
+
+ /* Nope, this is a skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ /* The sentinel value -inf is never decrementable */
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ return false;
+
+ /*
+ * When the current array element is NULL, and the lowest sorting value in
+ * the index is also NULL, we cannot decrement before first array element
+ */
+ if ((skey->sk_flags & SK_ISNULL) && (skey->sk_flags & SK_BT_NULLS_FIRST))
+ return false;
+
+ /*
+ * Opclasses without skip support "decrement" the scan key's current
+ * element by setting the NEXTPRIOR flag. The true prior value can only
+ * be determined when the scan reads lower sorting tuples.
+ *
+ * When the current array element is NULL, and the highest sorting value
+ * in the index is also NULL, _bt_first can find the highest non-NULL.
+ */
+ if (!array->use_sksup)
+ {
+ /*
+ * Determine as best we can (given the lack of skip support) whether
+ * the prior element will turn out to be out of bounds for the skip
+ * array.
+ *
+ * Skip arrays (that lack skip support) can only do this when their
+ * low_compare is for an >= inequality; if the current array element
+ * is == the inequality's sk_argument, then the true prior value
+ * cannot possibly satisfy low_compare. We can give up right away.
+ */
+ if (array->low_compare &&
+ array->low_compare->sk_strategy == BTGreaterEqualStrategyNumber &&
+ _bt_compare_array_skey(&array->order_low,
+ array->low_compare->sk_argument, false,
+ skey->sk_argument, false,
+ skey) == 0)
+ return false;
+
+ /* else the scan must figure out the true prior value */
+ skey->sk_flags |= SK_BT_NEXTPRIOR;
+ return true;
+ }
+
+ /*
+ * Opclasses with skip support decrement the scan key's current element
+ * using a callback
+ */
+ if (skey->sk_flags & SK_ISNULL)
+ {
+ Assert(!(skey->sk_flags & SK_BT_NULLS_FIRST));
+
+ /*
+ * Existing sk_argument/array element is NULL (for an IS NULL qual).
+ *
+ * Decrement current array element to the high_elem value provided by
+ * opclass skip support routine.
+ */
+ _bt_scankey_unset_isnull(rel, skey, array);
+ return true;
+ }
+
+ /*
+ * Ask opclass support routine to provide decremented copy of existing
+ * non-NULL sk_argument
+ */
+ dec_sk_argument = array->sksup.decrement(rel, skey->sk_argument, &underflow);
+
+ if (underflow)
+ {
+ if (array->null_elem && (skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to non-NULL low_elem
+ * provided by opclass skip support routine, but skip array's true
+ * lowest element is actually NULL.
+ *
+ * Decrement sk_argument to NULL.
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+ return true;
+ }
+
+ /* Cannot decrement before first array element */
+ return false;
+ }
+
+ /*
+ * Successfully decremented sk_argument to a non-NULL value. Make sure
+ * that the decremented value is still within the range of the skip array.
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ dec_sk_argument,
+ array->low_compare->sk_argument)))
+ {
+ /* Keep existing sk_argument after all */
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(dec_sk_argument));
+
+ /* Cannot decrement before first array element */
+ return false;
+ }
+
+ /* Accept non-NULL datum value from opclass decrement callback */
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = dec_sk_argument;
+
+ return true;
+}
+
+/*
+ * _bt_scankey_increment() -- increment array scan key's sk_argument
+ *
+ * Return value indicates whether caller's array was successfully incremented.
+ * Cannot increment an array whose current element is already the final one.
+ */
+static bool
+_bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ bool overflow = false;
+ Datum inc_sk_argument;
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_BT_NEXTPRIOR));
+
+ /* Regular (non-skip) array? */
+ if (array->num_elems != -1)
+ {
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ if (array->cur_elem < array->num_elems - 1)
+ {
+ /*
+ * Just copy over array datum (only skip arrays require freeing
+ * and allocating memory for sk_argument)
+ */
+ array->cur_elem++;
+ skey->sk_argument = array->elem_values[array->cur_elem];
+
+ /* Successfully incremented array */
+ return true;
+ }
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /* Nope, this is a skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ /* The sentinel value +inf is never incrementable */
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ return false;
+
+ /*
+ * When the current array element is NULL, and the highest sorting value
+ * in the index is also NULL, we cannot increment past the final element
+ */
+ if ((skey->sk_flags & SK_ISNULL) && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ return false;
+
+ /*
+ * Opclasses without skip support "increment" the scan key's current
+ * element by setting the NEXTPRIOR flag. The true next value can only be
+ * determined when the scan reads higher sorting tuples.
+ *
+ * When the current array element is NULL, and the lowest sorting value in
+ * the index is also NULL, _bt_first can find the lowest non-NULL.
+ */
+ if (!array->use_sksup)
+ {
+ /*
+ * Determine as best we can (given the lack of skip support) whether
+ * the next element will turn out to be out of bounds for the skip
+ * array.
+ *
+ * Skip arrays (that lack skip support) can only do this when their
+ * high_compare is for an <= inequality; if the current array element
+ * is == the inequality's sk_argument, then the true next value cannot
+ * possibly satisfy high_compare. We can give up right away.
+ */
+ if (array->high_compare &&
+ array->high_compare->sk_strategy == BTLessEqualStrategyNumber &&
+ _bt_compare_array_skey(&array->order_high,
+ array->high_compare->sk_argument, false,
+ skey->sk_argument, false,
+ skey) == 0)
+ return false;
+
+ /* else the scan must figure out the true next value */
+ skey->sk_flags |= SK_BT_NEXTPRIOR;
+ return true;
+ }
+
+ /*
+ * Opclasses with skip support increment the scan key's current element
+ * using a callback
+ */
+ if (skey->sk_flags & SK_ISNULL)
+ {
+ Assert(skey->sk_flags & SK_BT_NULLS_FIRST);
+
+ /*
+ * Existing sk_argument/array element is NULL (for an IS NULL qual).
+ *
+ * Increment current array element to the low_elem value provided by
+ * opclass skip support routine.
+ */
+ _bt_scankey_unset_isnull(rel, skey, array);
+ return true;
+ }
+
+ /*
+ * Ask opclass support routine to provide incremented copy of existing
+ * non-NULL sk_argument
+ */
+ inc_sk_argument = array->sksup.increment(rel, skey->sk_argument, &overflow);
+
+ if (overflow)
+ {
+ if (array->null_elem && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /*
+ * Existing sk_argument was already equal to non-NULL high_elem
+ * provided by opclass skip support routine, but skip array's true
+ * highest element is actually NULL.
+ *
+ * Increment sk_argument to NULL.
+ */
+ _bt_scankey_set_isnull(rel, skey, array);
+ return true;
+ }
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /*
+ * Successfully incremented sk_argument to a non-NULL value. Make sure
+ * that the incremented value is still within the range of the skip array.
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ inc_sk_argument,
+ array->high_compare->sk_argument)))
+ {
+ /* Keep existing sk_argument after all */
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(inc_sk_argument));
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /* Accept non-NULL datum value from opclass increment callback */
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = inc_sk_argument;
+
+ return true;
+}
+
/*
* _bt_advance_array_keys_increment() -- Advance to next set of array elements
*
@@ -1390,6 +2469,7 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
static bool
_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
/*
@@ -1399,29 +2479,30 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
*/
for (int i = so->numArrayKeys - 1; i >= 0; i--)
{
- BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
- ScanKey skey = &so->keyData[curArrayKey->scan_key];
- int cur_elem = curArrayKey->cur_elem;
- int num_elems = curArrayKey->num_elems;
- bool rolled = false;
+ BTArrayKeyInfo *array = &so->arrayKeys[i];
+ ScanKey skey = &so->keyData[array->scan_key];
- if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
+ if (ScanDirectionIsForward(dir))
{
- cur_elem = 0;
- rolled = true;
+ if (_bt_scankey_increment(rel, skey, array))
+ return true;
}
- else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
+ else
{
- cur_elem = num_elems - 1;
- rolled = true;
+ if (_bt_scankey_decrement(rel, skey, array))
+ return true;
}
- curArrayKey->cur_elem = cur_elem;
- skey->sk_argument = curArrayKey->elem_values[cur_elem];
- if (!rolled)
- return true;
+ /*
+ * Couldn't increment (or decrement) array. Handle array roll over.
+ *
+ * Start over at the array's lowest sorting value (or its highest
+ * value, for backward scans)...
+ */
+ _bt_scankey_set_low_or_high(rel, skey, array,
+ ScanDirectionIsForward(dir));
- /* Need to advance next array key, if any */
+ /* ...then increment (or decrement) next most significant array */
}
/*
@@ -1476,6 +2557,7 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
static void
_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int arrayidx = 0;
@@ -1483,7 +2565,6 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
ScanKey cur = so->keyData + ikey;
BTArrayKeyInfo *array = NULL;
- int first_elem_dir;
if (!(cur->sk_flags & SK_SEARCHARRAY) ||
cur->sk_strategy != BTEqualStrategyNumber)
@@ -1495,16 +2576,10 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
continue;
- if (ScanDirectionIsForward(dir))
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
+ Assert(array->num_elems != -1); /* No non-required skip arrays */
- if (array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -1568,6 +2643,8 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
{
ScanKey cur = so->keyData + ikey;
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
Datum tupdatum;
bool tupnull;
int32 result;
@@ -1629,9 +2706,66 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
- result = _bt_compare_array_skey(&so->orderProcs[ikey],
- tupdatum, tupnull,
- cur->sk_argument, cur);
+ if (!(cur->sk_flags & SK_BT_NEGPOSINF))
+ {
+ /* The scankey has a conventional sk_argument/element value */
+ result = _bt_compare_array_skey(&so->orderProcs[ikey],
+ tupdatum, tupnull,
+ sk_argument, sk_isnull, cur);
+
+ /*
+ * When scan key is marked NEXTPRIOR, the current array element is
+ * "sk_argument + infinitesimal" (or the current array element is
+ * "sk_argument - infinitesimal", during backwards scans)
+ */
+ if (result == 0 && (cur->sk_flags & SK_BT_NEXTPRIOR))
+ {
+ /*
+ * tupdatum is actually still < "sk_argument + infinitesimal"
+ * (or it's actually still > "sk_argument - infinitesimal")
+ */
+ return true;
+ }
+ }
+ else
+ {
+ /*
+ * The scankey searches for the sentinel value -inf/+inf.
+ *
+ * Note: -inf could mean "absolute" -inf, or it could represent
+ * the lowest possible value that still satisfies the array's
+ * low_compare. +inf and high_compare work similarly.
+ */
+ BTArrayKeyInfo *array = NULL;
+
+ for (int arrayidx = 0; arrayidx < so->numArrayKeys; arrayidx++)
+ {
+ array = &so->arrayKeys[arrayidx];
+ if (array->scan_key == ikey)
+ break;
+ }
+
+ /*
+ * Compare tupdatum against -inf using array's low_compare, if any
+ * (or compare it against +inf using array's high_compare).
+ *
+ * Optimization: avoid uselessly evaluating array's high_compare
+ * (or uselessly evaluating array's low_compare) by passing
+ * cur_elem_trig=true, along with an inverted scan direction.
+ */
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey], true, -dir,
+ tupdatum, tupnull, array, cur,
+ &result);
+
+ if (result == 0)
+ {
+ /*
+ * tupdatum is > -inf sk_argument (or < +inf sk_argument).
+ * It's time for caller to advance the scan's array keys.
+ */
+ return false;
+ }
+ }
/*
* Does this comparison indicate that caller must _not_ advance the
@@ -1963,18 +3097,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (beyond_end_advance)
{
- int final_elem_dir;
-
- if (ScanDirectionIsBackward(dir) || !array)
- final_elem_dir = 0;
- else
- final_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != final_elem_dir)
- {
- array->cur_elem = final_elem_dir;
- cur->sk_argument = array->elem_values[final_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
continue;
}
@@ -1999,18 +3124,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (!all_required_satisfied || cur->sk_attno > tupnatts)
{
- int first_elem_dir;
-
- if (ScanDirectionIsForward(dir) || !array)
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
continue;
}
@@ -2028,15 +3144,27 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
/*
* Binary search for closest match that's available from the array
*/
- set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
- cur_elem_trig, dir,
- tupdatum, tupnull, array, cur,
- &result);
+ if (array->num_elems != -1)
+ set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
- Assert(set_elem >= 0 && set_elem < array->num_elems);
+ /*
+ * "Binary search" by checking if tupdatum/tupnull are within the
+ * range of the skip array
+ */
+ else
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
}
else
{
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
+
Assert(sktrig_required && required);
/*
@@ -2050,7 +3178,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull, cur);
}
/*
@@ -2109,11 +3237,65 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
}
}
- /* Advance array keys, even when set_elem isn't an exact match */
- if (array && array->cur_elem != set_elem)
+ if (!array)
+ continue; /* cannot advance a non-array */
+
+ /* Advance array keys, even when we don't have an exact match */
+ if (array->num_elems != -1)
{
- array->cur_elem = set_elem;
- cur->sk_argument = array->elem_values[set_elem];
+ /* Conventional array, use set_elem... */
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ cur->sk_argument = array->elem_values[set_elem];
+ }
+
+ continue;
+ }
+
+ /*
+ * ...or skip array, which doesn't advance using a set_elem offset.
+ *
+ * Array "contains" elements for every possible datum from a given
+ * range of values. This is often the range -inf through to +inf.
+ * "Binary searching" a skip array only determines whether tupdatum is
+ * beyond its range, before its range, or within its range.
+ *
+ * Note: conventional arrays cannot use this approach. They need
+ * "beyond end of array element" advancement to distinguish between
+ * the final array element (where incremental advancement rolls over
+ * to the next most significant array), and some earlier array element
+ * (where incremental advancement just increments set_elem/cur_elem).
+ * That distinction doesn't exist when dealing with range skip arrays.
+ */
+ if (beyond_end_advance)
+ {
+ /*
+ * tupdatum/tupnull is > the skip array's "final element"
+ * (tupdatum/tupnull is < the "first element" for backwards scans)
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
+ }
+ else if (!all_required_satisfied)
+ {
+ /*
+ * tupdatum/tupnull is < the skip array's "first element"
+ * (tupdatum/tupnull is > the "final element" for backwards scans)
+ */
+ Assert(sktrig < ikey); /* check on _bt_tuple_before_array_skeys */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ /*
+ * tupdatum/tupnull is == some particular skip array element.
+ *
+ * Set scan key's sk_argument to tupdatum. If tupdatum is null,
+ * we'll set IS NULL flags in scan key's sk_flags instead.
+ */
+ _bt_scankey_set_element(rel, cur, array, tupdatum, tupnull);
}
}
@@ -2464,6 +3646,8 @@ end_toplevel_scan:
* within each attribute may be done as a byproduct of the processing here.
* That process must leave array scan keys (within an attribute) in the same
* order as corresponding entries from the scan's BTArrayKeyInfo array info.
+ * We might also cons up skip array scan keys that weren't present in the
+ * original input keys; these are also output in standard attribute order.
*
* The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
* if they must be satisfied in order to continue the scan forward or backward
@@ -2587,8 +3771,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
inputsk = scan->keyData;
/*
- * Now that we have an estimate of the number of output scan keys,
- * allocate space for them
+ * Now that we have an estimate of the number of output scan keys
+ * (including any skip array scan keys), allocate space for them
*/
so->keyData = palloc(sizeof(ScanKeyData) * numberOfKeys);
@@ -2724,7 +3908,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
return;
}
/* else discard the redundant non-equality key */
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
xform[j].skey = NULL;
xform[j].ikey = -1;
}
@@ -2887,7 +4072,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
/* Have all we need to determine redundancy */
if (test_result)
{
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
/*
* New key is more restrictive, and so replaces old key...
@@ -3029,10 +4215,11 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
if (array->scan_key != ikey)
return false;
- if (array->num_elems <= 0)
+ if (array->num_elems == 0 || array->num_elems < -1)
return false;
- if (cur->sk_argument != array->elem_values[array->cur_elem])
+ if (array->num_elems != -1 &&
+ cur->sk_argument != array->elem_values[array->cur_elem])
return false;
if (last_sk_attno > cur->sk_attno)
return false;
@@ -3107,6 +4294,22 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool leftnull,
rightnull;
+ /* Handle skip array comparison with IS NOT NULL scan key */
+ if ((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP)
+ {
+ /* Shouldn't generate skip array in presence of IS NULL key */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNULL));
+ Assert((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNOTNULL);
+
+ /* Skip array will have no NULL element/IS NULL scan key */
+ Assert(array->num_elems == -1);
+ array->null_elem = false;
+
+ /* IS NOT NULL key (could be leftarg or rightarg) now redundant */
+ *result = true;
+ return true;
+ }
+
if (leftarg->sk_flags & SK_ISNULL)
{
Assert(leftarg->sk_flags & (SK_SEARCHNULL | SK_SEARCHNOTNULL));
@@ -3180,6 +4383,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
{
/* Can't make the comparison */
*result = false; /* suppress compiler warnings */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
return false;
}
@@ -3743,6 +4947,20 @@ _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
continue;
}
+ /*
+ * A skip array scan key might be negative/positive infinity. Might
+ * also be next key/prior key sentinel, which we don't deal with.
+ */
+ if (key->sk_flags & (SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR))
+ {
+ Assert(key->sk_flags & SK_SEARCHARRAY);
+ Assert(key->sk_flags & SK_BT_SKIP);
+ Assert(requiredSameDir);
+
+ *continuescan = false;
+ return false;
+ }
+
/* row-comparison keys need special processing */
if (key->sk_flags & SK_ROW_HEADER)
{
diff --git a/src/backend/access/nbtree/nbtvalidate.c b/src/backend/access/nbtree/nbtvalidate.c
index e9d4cd60d..96d0d9185 100644
--- a/src/backend/access/nbtree/nbtvalidate.c
+++ b/src/backend/access/nbtree/nbtvalidate.c
@@ -114,6 +114,10 @@ btvalidate(Oid opclassoid)
case BTOPTIONS_PROC:
ok = check_amoptsproc_signature(procform->amproc);
break;
+ case BTSKIPSUPPORT_PROC:
+ ok = check_amproc_signature(procform->amproc, VOIDOID, true,
+ 1, 1, INTERNALOID);
+ break;
default:
ereport(INFO,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index b8b5c147c..a86dbf71b 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -1330,6 +1330,31 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("btree equal image functions must not be cross-type")));
}
+ else if (member->number == BTSKIPSUPPORT_PROC)
+ {
+ if (procform->pronargs != 1 ||
+ procform->proargtypes.values[0] != INTERNALOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must accept type \"internal\"")));
+ if (procform->prorettype != VOIDOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must return void")));
+
+ /*
+ * pg_amproc functions are indexed by (lefttype, righttype), but a
+ * skip support function doesn't make sense in cross-type
+ * scenarios. The same opclass opcintype OID is always used for
+ * lefttype and righttype. Providing a cross-type routine isn't
+ * sensible. Reject cross-type ALTER OPERATOR FAMILY ... ADD
+ * FUNCTION 6 statements here.
+ */
+ if (member->lefttype != member->righttype)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must not be cross-type")));
+ }
}
else if (amoid == HASH_AM_OID)
{
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db6ed784a..86a5de8f4 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -143,6 +143,7 @@ static const char *const BuiltinTrancheNames[] = {
[LWTRANCHE_LOCK_MANAGER] = "LockManager",
[LWTRANCHE_PREDICATE_LOCK_MANAGER] = "PredicateLockManager",
[LWTRANCHE_PARALLEL_HASH_JOIN] = "ParallelHashJoin",
+ [LWTRANCHE_PARALLEL_BTREE_SCAN] = "ParallelBtreeScan",
[LWTRANCHE_PARALLEL_QUERY_DSA] = "ParallelQueryDSA",
[LWTRANCHE_PER_SESSION_DSA] = "PerSessionDSA",
[LWTRANCHE_PER_SESSION_RECORD_TYPE] = "PerSessionRecordType",
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d..6efa3e353 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -372,6 +372,7 @@ BufferMapping "Waiting to associate a data block with a buffer in the buffer poo
LockManager "Waiting to read or update information about <quote>heavyweight</quote> locks."
PredicateLockManager "Waiting to access predicate lock information used by serializable transactions."
ParallelHashJoin "Waiting to synchronize workers during Parallel Hash Join plan execution."
+ParallelBtreeScan "Waiting to synchronize workers during a parallel B-tree scan."
ParallelQueryDSA "Waiting for parallel query dynamic shared memory allocation."
PerSessionDSA "Waiting for parallel query dynamic shared memory allocation."
PerSessionRecordType "Waiting to access a parallel query's information about composite types."
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index edb09d4e3..e945686c8 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -96,6 +96,7 @@ OBJS = \
rowtypes.o \
ruleutils.o \
selfuncs.o \
+ skipsupport.o \
tid.o \
timestamp.o \
trigfuncs.o \
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 9c854e0e5..79658f068 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -34,6 +34,7 @@
#include "utils/date.h"
#include "utils/datetime.h"
#include "utils/numeric.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
/*
@@ -455,6 +456,49 @@ date_sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+date_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ if (dexisting == DATEVAL_NOBEGIN)
+ {
+ *underflow = true;
+ return 0;
+ }
+
+ *underflow = false;
+ return DateADTGetDatum(dexisting - 1);
+}
+
+static Datum
+date_increment(Relation rel, Datum existing, bool *overflow)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ if (dexisting == DATEVAL_NOEND)
+ {
+ *overflow = true;
+ return 0;
+ }
+
+ *overflow = false;
+ return DateADTGetDatum(dexisting + 1);
+}
+
+Datum
+date_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = date_decrement;
+ sksup->increment = date_increment;
+ sksup->low_elem = DateADTGetDatum(DATEVAL_NOBEGIN);
+ sksup->high_elem = DateADTGetDatum(DATEVAL_NOEND);
+
+ PG_RETURN_VOID();
+}
+
Datum
date_finite(PG_FUNCTION_ARGS)
{
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index 8c6fc80c3..91682edd5 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -83,6 +83,7 @@ backend_sources += files(
'rowtypes.c',
'ruleutils.c',
'selfuncs.c',
+ 'skipsupport.c',
'tid.c',
'timestamp.c',
'trigfuncs.c',
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bf42393be..4c1841a2b 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -192,6 +192,8 @@ static double convert_timevalue_to_scalar(Datum value, Oid typid,
bool *failure);
static void examine_simple_variable(PlannerInfo *root, Var *var,
VariableStatData *vardata);
+static void examine_indexcol_variable(PlannerInfo *root, IndexOptInfo *index,
+ int indexcol, VariableStatData *vardata);
static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata,
Oid sortop, Oid collation,
Datum *min, Datum *max);
@@ -213,6 +215,8 @@ static bool get_actual_variable_endpoint(Relation heapRel,
MemoryContext outercontext,
Datum *endpointDatum);
static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
+static double btcost_correlation(IndexOptInfo *index,
+ VariableStatData *vardata);
/*
@@ -5731,6 +5735,92 @@ examine_simple_variable(PlannerInfo *root, Var *var,
}
}
+/*
+ * examine_indexcol_variable
+ * Try to look up statistical data about an index column/expression.
+ * Fill in a VariableStatData struct to describe the column.
+ *
+ * Inputs:
+ * root: the planner info
+ * index: the index whose column we're interested in
+ * indexcol: 0-based index column number (subscripts index->indexkeys[])
+ *
+ * Outputs: *vardata is filled as follows:
+ * var: the input expression (with any binary relabeling stripped, if
+ * it is or contains a variable; but otherwise the type is preserved)
+ * rel: RelOptInfo for table relation containing variable.
+ * statsTuple: the pg_statistic entry for the variable, if one exists;
+ * otherwise NULL.
+ * freefunc: pointer to a function to release statsTuple with.
+ *
+ * Caller is responsible for doing ReleaseVariableStats() before exiting.
+ */
+static void
+examine_indexcol_variable(PlannerInfo *root, IndexOptInfo *index,
+ int indexcol, VariableStatData *vardata)
+{
+ AttrNumber colnum;
+ Oid relid;
+
+ if (index->indexkeys[indexcol] != 0)
+ {
+ /* Simple variable --- look to stats for the underlying table */
+ RangeTblEntry *rte = planner_rt_fetch(index->rel->relid, root);
+
+ Assert(rte->rtekind == RTE_RELATION);
+ relid = rte->relid;
+ Assert(relid != InvalidOid);
+ colnum = index->indexkeys[indexcol];
+ vardata->rel = index->rel;
+
+ if (get_relation_stats_hook &&
+ (*get_relation_stats_hook) (root, rte, colnum, vardata))
+ {
+ /*
+ * The hook took control of acquiring a stats tuple. If it did
+ * supply a tuple, it'd better have supplied a freefunc.
+ */
+ if (HeapTupleIsValid(vardata->statsTuple) &&
+ !vardata->freefunc)
+ elog(ERROR, "no function provided to release variable stats with");
+ }
+ else
+ {
+ vardata->statsTuple = SearchSysCache3(STATRELATTINH,
+ ObjectIdGetDatum(relid),
+ Int16GetDatum(colnum),
+ BoolGetDatum(rte->inh));
+ vardata->freefunc = ReleaseSysCache;
+ }
+ }
+ else
+ {
+ /* Expression --- maybe there are stats for the index itself */
+ relid = index->indexoid;
+ colnum = indexcol + 1;
+
+ if (get_index_stats_hook &&
+ (*get_index_stats_hook) (root, relid, colnum, vardata))
+ {
+ /*
+ * The hook took control of acquiring a stats tuple. If it did
+ * supply a tuple, it'd better have supplied a freefunc.
+ */
+ if (HeapTupleIsValid(vardata->statsTuple) &&
+ !vardata->freefunc)
+ elog(ERROR, "no function provided to release variable stats with");
+ }
+ else
+ {
+ vardata->statsTuple = SearchSysCache3(STATRELATTINH,
+ ObjectIdGetDatum(relid),
+ Int16GetDatum(colnum),
+ BoolGetDatum(false));
+ vardata->freefunc = ReleaseSysCache;
+ }
+ }
+}
+
/*
* Check whether it is permitted to call func_oid passing some of the
* pg_statistic data in vardata. We allow this either if the user has SELECT
@@ -6789,6 +6879,54 @@ add_predicate_to_index_quals(IndexOptInfo *index, List *indexQuals)
return list_concat(predExtraQuals, indexQuals);
}
+/*
+ * Estimate correlation of btree index's first column.
+ *
+ * If we can get an estimate of the first column's ordering correlation C
+ * from pg_statistic, estimate the index correlation as C for a
+ * single-column index, or C * 0.75 for multiple columns. (The idea here
+ * is that multiple columns dilute the importance of the first column's
+ * ordering, but don't negate it entirely. Before 8.0 we divided the
+ * correlation by the number of columns, but that seems too strong.)
+ *
+ * We already filled in the stats tuple for *vardata when called.
+ */
+static double
+btcost_correlation(IndexOptInfo *index, VariableStatData *vardata)
+{
+ Oid sortop;
+ AttStatsSlot sslot;
+ double indexCorrelation = 0;
+
+ Assert(HeapTupleIsValid(vardata->statsTuple));
+
+ sortop = get_opfamily_member(index->opfamily[0],
+ index->opcintype[0],
+ index->opcintype[0],
+ BTLessStrategyNumber);
+ if (OidIsValid(sortop) &&
+ get_attstatsslot(&sslot, vardata->statsTuple,
+ STATISTIC_KIND_CORRELATION, sortop,
+ ATTSTATSSLOT_NUMBERS))
+ {
+ double varCorrelation;
+
+ Assert(sslot.nnumbers == 1);
+ varCorrelation = sslot.numbers[0];
+
+ if (index->reverse_sort[0])
+ varCorrelation = -varCorrelation;
+
+ if (index->nkeycolumns > 1)
+ indexCorrelation = varCorrelation * 0.75;
+ else
+ indexCorrelation = varCorrelation;
+
+ free_attstatsslot(&sslot);
+ }
+
+ return indexCorrelation;
+}
void
btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
@@ -6798,17 +6936,21 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
{
IndexOptInfo *index = path->indexinfo;
GenericCosts costs = {0};
- Oid relid;
- AttrNumber colnum;
VariableStatData vardata = {0};
double numIndexTuples;
Cost descentCost;
List *indexBoundQuals;
int indexcol;
bool eqQualHere;
+ bool upperInequalHere;
+ bool lowerInequalHere;
+ bool have_correlation = false;
+ bool found_skip;
bool found_saop;
bool found_is_null_op;
+ double inequalselectivity = 1.0;
double num_sa_scans;
+ double correlation = 0;
ListCell *lc;
/*
@@ -6824,13 +6966,17 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* For a RowCompareExpr, we consider only the first column, just as
* rowcomparesel() does.
*
- * If there's a ScalarArrayOpExpr in the quals, we'll actually perform up
- * to N index descents (not just one), but the ScalarArrayOpExpr's
- * operator can be considered to act the same as it normally does.
+ * If there's a ScalarArrayOpExpr in the quals, or if we expect to
+ * generate a skip scan array, then we'll actually perform up to N index
+ * descents (not just one), but the underlying operator can be considered
+ * to act the same as it normally does.
*/
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
+ upperInequalHere = false;
+ lowerInequalHere = false;
+ found_skip = false;
found_saop = false;
found_is_null_op = false;
num_sa_scans = 1;
@@ -6841,13 +6987,81 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (indexcol != iclause->indexcol)
{
+ bool first = true;
+
/* Beginning of a new column's quals */
- if (!eqQualHere)
- break; /* done if no '=' qual for indexcol */
+ if (eqQualHere)
+ indexcol++; /* don't skip the previous '=' qual's column */
+
+ /*
+ * Now estimate number of "array elements" using ndistinct.
+ *
+ * Internally, nbtree treats skip scans as scans with SAOP style
+ * arrays that generate elements procedurally. We effectively
+ * assume a "col = ANY('{every possible col value}')" qual.
+ */
+ while (indexcol < iclause->indexcol)
+ {
+ double ndistinct = DEFAULT_NUM_DISTINCT;
+ bool isdefault = true;
+
+ /* Attain ndistinct for index column/indexed expression */
+ examine_indexcol_variable(root, index, indexcol, &vardata);
+ if (HeapTupleIsValid(vardata.statsTuple))
+ {
+ ndistinct = get_variable_numdistinct(&vardata, &isdefault);
+
+ if (indexcol == 0)
+ {
+ /*
+ * Get an estimate of the leading column's correlation
+ * in passing (avoids rereading variable stats below)
+ */
+ correlation = btcost_correlation(index, &vardata);
+ have_correlation = true;
+ }
+ }
+
+ ReleaseVariableStats(vardata);
+
+ if (first)
+ {
+ first = false;
+
+ /*
+ * Apply the selectivities of any inequalities to
+ * ndistinct (unless ndistinct is only a default estimate)
+ */
+ if (!isdefault)
+ ndistinct *= inequalselectivity;
+
+ /*
+ * Skip scan will likely require an initial index descent
+ * to find out what the real first element is..
+ */
+ if (!upperInequalHere)
+ ndistinct += 1;
+
+ /*
+ * ...and another extra descent to confirm no further
+ * groupings/matches
+ */
+ if (!lowerInequalHere)
+ ndistinct += 1;
+ }
+
+ num_sa_scans *= ndistinct;
+
+ /* Done costing skipping for this index column */
+ indexcol++;
+ found_skip = true;
+ }
+
+ /* new index column resets tracking variables */
eqQualHere = false;
- indexcol++;
- if (indexcol != iclause->indexcol)
- break; /* no quals at all for indexcol */
+ upperInequalHere = false;
+ lowerInequalHere = false;
+ inequalselectivity = 1.0;
}
/* Examine each indexqual associated with this index clause */
@@ -6889,7 +7103,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (nt->nulltesttype == IS_NULL)
{
found_is_null_op = true;
- /* IS NULL is like = for selectivity purposes */
+ /* IS NULL is like = for selectivity/skipping purposes */
eqQualHere = true;
}
}
@@ -6905,6 +7119,38 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
Assert(op_strategy != 0); /* not a member of opfamily?? */
if (op_strategy == BTEqualStrategyNumber)
eqQualHere = true;
+ else if (rinfo->norm_selec >= 0)
+ {
+ bool useinequal = true;
+
+ /*
+ * Skip scan requires tracking inequality selectivities to
+ * compute an adjusted whole-column ndistinct
+ */
+ if (op_strategy < BTEqualStrategyNumber)
+ {
+ if (upperInequalHere)
+ useinequal = false;
+ upperInequalHere = true;
+ }
+ else
+ {
+ if (lowerInequalHere)
+ useinequal = false;
+ lowerInequalHere = true;
+ }
+
+ /*
+ * Assume inequality selectivities are _not_ independent,
+ * but only track up to one upper bound inequality and up
+ * to one lower bound inequality. This avoids wildly
+ * wrong estimates given redundant operators.
+ */
+ if (useinequal)
+ inequalselectivity =
+ Max(inequalselectivity - (1.0 - rinfo->norm_selec),
+ DEFAULT_RANGE_INEQ_SEL);
+ }
}
indexBoundQuals = lappend(indexBoundQuals, rinfo);
@@ -6920,6 +7166,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (index->unique &&
indexcol == index->nkeycolumns - 1 &&
eqQualHere &&
+ !found_skip &&
!found_saop &&
!found_is_null_op)
numIndexTuples = 1.0;
@@ -7028,104 +7275,19 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
costs.indexStartupCost += descentCost;
costs.indexTotalCost += costs.num_sa_scans * descentCost;
- /*
- * If we can get an estimate of the first column's ordering correlation C
- * from pg_statistic, estimate the index correlation as C for a
- * single-column index, or C * 0.75 for multiple columns. (The idea here
- * is that multiple columns dilute the importance of the first column's
- * ordering, but don't negate it entirely. Before 8.0 we divided the
- * correlation by the number of columns, but that seems too strong.)
- */
- if (index->indexkeys[0] != 0)
+ if (!have_correlation)
{
- /* Simple variable --- look to stats for the underlying table */
- RangeTblEntry *rte = planner_rt_fetch(index->rel->relid, root);
-
- Assert(rte->rtekind == RTE_RELATION);
- relid = rte->relid;
- Assert(relid != InvalidOid);
- colnum = index->indexkeys[0];
-
- if (get_relation_stats_hook &&
- (*get_relation_stats_hook) (root, rte, colnum, &vardata))
- {
- /*
- * The hook took control of acquiring a stats tuple. If it did
- * supply a tuple, it'd better have supplied a freefunc.
- */
- if (HeapTupleIsValid(vardata.statsTuple) &&
- !vardata.freefunc)
- elog(ERROR, "no function provided to release variable stats with");
- }
- else
- {
- vardata.statsTuple = SearchSysCache3(STATRELATTINH,
- ObjectIdGetDatum(relid),
- Int16GetDatum(colnum),
- BoolGetDatum(rte->inh));
- vardata.freefunc = ReleaseSysCache;
- }
+ examine_indexcol_variable(root, index, 0, &vardata);
+ if (HeapTupleIsValid(vardata.statsTuple))
+ costs.indexCorrelation = btcost_correlation(index, &vardata);
+ ReleaseVariableStats(vardata);
}
else
{
- /* Expression --- maybe there are stats for the index itself */
- relid = index->indexoid;
- colnum = 1;
-
- if (get_index_stats_hook &&
- (*get_index_stats_hook) (root, relid, colnum, &vardata))
- {
- /*
- * The hook took control of acquiring a stats tuple. If it did
- * supply a tuple, it'd better have supplied a freefunc.
- */
- if (HeapTupleIsValid(vardata.statsTuple) &&
- !vardata.freefunc)
- elog(ERROR, "no function provided to release variable stats with");
- }
- else
- {
- vardata.statsTuple = SearchSysCache3(STATRELATTINH,
- ObjectIdGetDatum(relid),
- Int16GetDatum(colnum),
- BoolGetDatum(false));
- vardata.freefunc = ReleaseSysCache;
- }
+ /* get_variable_index_correlation called earlier */
+ costs.indexCorrelation = correlation;
}
- if (HeapTupleIsValid(vardata.statsTuple))
- {
- Oid sortop;
- AttStatsSlot sslot;
-
- sortop = get_opfamily_member(index->opfamily[0],
- index->opcintype[0],
- index->opcintype[0],
- BTLessStrategyNumber);
- if (OidIsValid(sortop) &&
- get_attstatsslot(&sslot, vardata.statsTuple,
- STATISTIC_KIND_CORRELATION, sortop,
- ATTSTATSSLOT_NUMBERS))
- {
- double varCorrelation;
-
- Assert(sslot.nnumbers == 1);
- varCorrelation = sslot.numbers[0];
-
- if (index->reverse_sort[0])
- varCorrelation = -varCorrelation;
-
- if (index->nkeycolumns > 1)
- costs.indexCorrelation = varCorrelation * 0.75;
- else
- costs.indexCorrelation = varCorrelation;
-
- free_attstatsslot(&sslot);
- }
- }
-
- ReleaseVariableStats(vardata);
-
*indexStartupCost = costs.indexStartupCost;
*indexTotalCost = costs.indexTotalCost;
*indexSelectivity = costs.indexSelectivity;
diff --git a/src/backend/utils/adt/skipsupport.c b/src/backend/utils/adt/skipsupport.c
new file mode 100644
index 000000000..d91471e26
--- /dev/null
+++ b/src/backend/utils/adt/skipsupport.c
@@ -0,0 +1,60 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.c
+ * Support routines for B-Tree skip scan.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/skipsupport.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "utils/lsyscache.h"
+#include "utils/skipsupport.h"
+
+/*
+ * Fill in SkipSupport given an operator class (opfamily + opcintype).
+ *
+ * On success, returns true, and initializes all SkipSupport fields for
+ * caller. Otherwise returns false, indicating that operator class has no
+ * skip support function.
+ */
+bool
+PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype, bool reverse,
+ SkipSupport sksup)
+{
+ Oid skipSupportFunction;
+
+ /* Look for a skip support function */
+ skipSupportFunction = get_opfamily_proc(opfamily, opcintype, opcintype,
+ BTSKIPSUPPORT_PROC);
+ if (!OidIsValid(skipSupportFunction))
+ return false;
+
+ OidFunctionCall1(skipSupportFunction, PointerGetDatum(sksup));
+
+ if (reverse)
+ {
+ /*
+ * DESC/reverse case: swap low_elem with high_elem, and swap decrement
+ * with increment
+ */
+ Datum low_elem = sksup->low_elem;
+ SkipSupportIncDec decrement = sksup->decrement;
+
+ sksup->low_elem = sksup->high_elem;
+ sksup->decrement = sksup->increment;
+
+ sksup->high_elem = low_elem;
+ sksup->increment = decrement;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c
index 5284d23dc..81c3494ea 100644
--- a/src/backend/utils/adt/uuid.c
+++ b/src/backend/utils/adt/uuid.c
@@ -13,12 +13,15 @@
#include "postgres.h"
+#include <limits.h>
+
#include "common/hashfn.h"
#include "lib/hyperloglog.h"
#include "libpq/pqformat.h"
#include "port/pg_bswap.h"
#include "utils/fmgrprotos.h"
#include "utils/guc.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#include "utils/timestamp.h"
#include "utils/uuid.h"
@@ -384,6 +387,70 @@ uuid_abbrev_convert(Datum original, SortSupport ssup)
return res;
}
+static Datum
+uuid_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ *underflow = false;
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] > 0)
+ {
+ uuid->data[i]--;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = UCHAR_MAX;
+ }
+
+ *underflow = true;
+
+ return 0;
+}
+
+static Datum
+uuid_increment(Relation rel, Datum existing, bool *overflow)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ *overflow = false;
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] < UCHAR_MAX)
+ {
+ uuid->data[i]++;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = 0;
+ }
+
+ *overflow = true;
+
+ return 0;
+}
+
+Datum
+uuid_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+ pg_uuid_t *uuid_min = palloc(UUID_LEN);
+ pg_uuid_t *uuid_max = palloc(UUID_LEN);
+
+ memset(uuid_min->data, 0x00, UUID_LEN);
+ memset(uuid_max->data, 0xFF, UUID_LEN);
+
+ sksup->decrement = uuid_decrement;
+ sksup->increment = uuid_increment;
+ sksup->low_elem = UUIDPGetDatum(uuid_min);
+ sksup->high_elem = UUIDPGetDatum(uuid_max);
+
+ PG_RETURN_VOID();
+}
+
/* hash index support */
Datum
uuid_hash(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 521ec5591..239baa7f3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/nbtree.h"
#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
@@ -1751,6 +1752,17 @@ struct config_bool ConfigureNamesBool[] =
},
#endif
+ /* XXX Remove before commit */
+ {
+ {"skipscan_skipsupport_enabled", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_skipsupport_enabled,
+ true,
+ NULL, NULL, NULL
+ },
+
{
{"integer_datetimes", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows whether datetimes are integer based."),
@@ -3590,6 +3602,17 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ /* XXX Remove before commit */
+ {
+ {"skipscan_prefix_cols", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_prefix_cols,
+ INDEX_MAX_KEYS, 0, INDEX_MAX_KEYS,
+ NULL, NULL, NULL
+ },
+
{
/* Can't be set in postgresql.conf */
{"server_version_num", PGC_INTERNAL, PRESET_OPTIONS,
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 2b3997988..9662fb2ba 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -583,6 +583,19 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</para>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><function>skipsupport</function></term>
+ <listitem>
+ <para>
+ Optionally, a btree operator family may provide a <firstterm>skip
+ support</firstterm> function, registered under support function
+ number 6. These functions allow the B-tree code to more efficiently
+ navigate the index structure via an index <quote>skip scan</quote>. The
+ APIs involved in this are defined in
+ <filename>src/include/utils/skipsupport.h</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</sect2>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index e3c1539a1..4c586bc8a 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -809,7 +809,8 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
Size
-amestimateparallelscan (int nkeys,
+amestimateparallelscan (Relation indexRelation,
+ int nkeys,
int norderbys);
</programlisting>
Estimate and return the number of bytes of dynamic shared memory which
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 6d731e070..433e108b8 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -457,23 +457,26 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
<para>
A multicolumn B-tree index can be used with query conditions that
involve any subset of the index's columns, but the index is most
- efficient when there are constraints on the leading (leftmost) columns.
- The exact rule is that equality constraints on leading columns, plus
- any inequality constraints on the first column that does not have an
- equality constraint, will be used to limit the portion of the index
- that is scanned. Constraints on columns to the right of these columns
- are checked in the index, so they save visits to the table proper, but
- they do not reduce the portion of the index that has to be scanned.
+ efficient when there are equality constraints on the leading (leftmost) columns.
+ B-Tree index scans can use the index skip scan strategy to generate
+ equality constraints on prefix columns that were wholly omitted from the
+ query predicate, as well as prefix columns whose values were constrained by
+ inequality conditions.
For example, given an index on <literal>(a, b, c)</literal> and a
query condition <literal>WHERE a = 5 AND b >= 42 AND c < 77</literal>,
the index would have to be scanned from the first entry with
<literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last entry with
- <literal>a</literal> = 5. Index entries with <literal>c</literal> >= 77 would be
- skipped, but they'd still have to be scanned through.
+ <literal>a</literal> = 5. Intevening groups of index entries with
+ <literal>c</literal> >= 77 would not need to be returned by the scan,
+ and can be skipped over entirely by applying the skip scan strategy.
This index could in principle be used for queries that have constraints
on <literal>b</literal> and/or <literal>c</literal> with no constraint on <literal>a</literal>
- — but the entire index would have to be scanned, so in most cases
- the planner would prefer a sequential table scan over using the index.
+ — but that approach is generally only taken when there are so few
+ distinct <literal>a</literal> values that the planner expects the skip scan
+ strategy to allow the scan to skip over most individual index leaf pages.
+ If there are many distinct <literal>a</literal> values, then the entire
+ index will have to be scanned, so in most cases the planner will prefer a
+ sequential table scan over using the index.
</para>
<para>
@@ -508,10 +511,7 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
</para>
<para>
- Multicolumn indexes should be used sparingly. In most situations,
- an index on a single column is sufficient and saves space and time.
- Indexes with more than three columns are unlikely to be helpful
- unless the usage of the table is extremely stylized. See also
+ Multicolumn indexes should be used judiciously. See
<xref linkend="indexes-bitmap-scans"/> and
<xref linkend="indexes-index-only-scans"/> for some discussion of the
merits of different index configurations.
@@ -669,9 +669,13 @@ CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);
multicolumn index on <literal>(x, y)</literal>. This index would typically be
more efficient than index combination for queries involving both
columns, but as discussed in <xref linkend="indexes-multicolumn"/>, it
- would be almost useless for queries involving only <literal>y</literal>, so it
- should not be the only index. A combination of the multicolumn index
- and a separate index on <literal>y</literal> would serve reasonably well. For
+ would be less useful for queries involving only <literal>y</literal>. Just
+ how useful might depend on how effective the B-Tree index skip scan
+ optimization is; if <literal>x</literal> has no more than several hundred
+ distinct values, skip scan will make searches for specific
+ <literal>y</literal> values execute reasonably efficiently. A combination
+ of a multicolumn index on <literal>(x, y)</literal> and a separate index on
+ <literal>y</literal> might also serve reasonably well. For
queries involving only <literal>x</literal>, the multicolumn index could be
used, though it would be larger and hence slower than an index on
<literal>x</literal> alone. The last alternative is to create all three
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 22d8ad1aa..63f03f3a7 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -461,6 +461,13 @@
</entry>
<entry>5</entry>
</row>
+ <row>
+ <entry>
+ Return the addresses of C-callable skip support function(s)
+ (optional)
+ </entry>
+ <entry>6</entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -1056,7 +1063,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
FUNCTION 1 btint8cmp(int8, int8) ,
FUNCTION 2 btint8sortsupport(internal) ,
FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint8skipsupport(internal) ;
CREATE OPERATOR CLASS int4_ops
DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -1069,7 +1077,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
FUNCTION 1 btint4cmp(int4, int4) ,
FUNCTION 2 btint4sortsupport(internal) ,
FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint4skipsupport(internal) ;
CREATE OPERATOR CLASS int2_ops
DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1082,7 +1091,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
FUNCTION 1 btint2cmp(int2, int2) ,
FUNCTION 2 btint2sortsupport(internal) ,
FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint2skipsupport(internal) ;
ALTER OPERATOR FAMILY integer_ops USING btree ADD
-- cross-type comparisons int8 vs int2
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index ae54cb254..8b6b775c1 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -362,9 +362,9 @@ ERROR: invalid operator number 0, must be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ERROR: operator argument types must be specified in ALTER OPERATOR FAMILY
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ERROR: invalid function number 0, must be between 1 and 5
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR: invalid function number 6, must be between 1 and 5
+ERROR: invalid function number 0, must be between 1 and 6
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
+ERROR: invalid function number 7, must be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
ERROR: STORAGE cannot be specified in ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index cf6eac573..f7b3ecef4 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1938,7 +1938,7 @@ ORDER BY unique1;
42
(3 rows)
--- Non-required array scan key on "tenthous":
+-- Skip array on "thousand", SAOP array on "tenthous":
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1958,7 +1958,7 @@ ORDER BY thousand;
1 | 1001
(2 rows)
--- Non-required array scan key on "tenthous", backward scan:
+-- Skip array on "thousand", SAOP array on "tenthous", backward scan:
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 31fb7d142..8c2a939b0 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4370,24 +4370,25 @@ select b.unique1 from
join int4_tbl i1 on b.thousand = f1
right join int4_tbl i2 on i2.f1 = b.tenthous
order by 1;
- QUERY PLAN
------------------------------------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------
Sort
Sort Key: b.unique1
-> Nested Loop Left Join
-> Seq Scan on int4_tbl i2
- -> Nested Loop Left Join
- Join Filter: (b.unique1 = 42)
- -> Nested Loop
+ -> Nested Loop
+ Join Filter: (b.thousand = i1.f1)
+ -> Nested Loop Left Join
+ Join Filter: (b.unique1 = 42)
-> Nested Loop
- -> Seq Scan on int4_tbl i1
-> Index Scan using tenk1_thous_tenthous on tenk1 b
- Index Cond: ((thousand = i1.f1) AND (tenthous = i2.f1))
- -> Index Scan using tenk1_unique1 on tenk1 a
- Index Cond: (unique1 = b.unique2)
- -> Index Only Scan using tenk1_thous_tenthous on tenk1 c
- Index Cond: (thousand = a.thousand)
-(15 rows)
+ Index Cond: (tenthous = i2.f1)
+ -> Index Scan using tenk1_unique1 on tenk1 a
+ Index Cond: (unique1 = b.unique2)
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 c
+ Index Cond: (thousand = a.thousand)
+ -> Seq Scan on int4_tbl i1
+(16 rows)
select b.unique1 from
tenk1 a join tenk1 b on a.unique1 = b.unique2
@@ -7482,19 +7483,23 @@ select * from fkest f1
join fkest f2 on (f1.x = f2.x and f1.x10 = f2.x10b and f1.x100 = f2.x100)
join fkest f3 on f1.x = f3.x
where f1.x100 = 2;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-------------------------------------------------------------------
Nested Loop
-> Hash Join
Hash Cond: ((f2.x = f1.x) AND (f2.x10b = f1.x10))
- -> Seq Scan on fkest f2
- Filter: (x100 = 2)
+ -> Bitmap Heap Scan on fkest f2
+ Recheck Cond: (x100 = 2)
+ -> Bitmap Index Scan on fkest_x_x10_x100_idx
+ Index Cond: (x100 = 2)
-> Hash
- -> Seq Scan on fkest f1
- Filter: (x100 = 2)
+ -> Bitmap Heap Scan on fkest f1
+ Recheck Cond: (x100 = 2)
+ -> Bitmap Index Scan on fkest_x_x10_x100_idx
+ Index Cond: (x100 = 2)
-> Index Scan using fkest_x_x10_x100_idx on fkest f3
Index Cond: (x = f1.x)
-(10 rows)
+(14 rows)
alter table fkest add constraint fk
foreign key (x, x10b, x100) references fkest (x, x10, x100);
@@ -7503,20 +7508,24 @@ select * from fkest f1
join fkest f2 on (f1.x = f2.x and f1.x10 = f2.x10b and f1.x100 = f2.x100)
join fkest f3 on f1.x = f3.x
where f1.x100 = 2;
- QUERY PLAN
------------------------------------------------------
+ QUERY PLAN
+-------------------------------------------------------------------
Hash Join
Hash Cond: ((f2.x = f1.x) AND (f2.x10b = f1.x10))
-> Hash Join
Hash Cond: (f3.x = f2.x)
-> Seq Scan on fkest f3
-> Hash
- -> Seq Scan on fkest f2
- Filter: (x100 = 2)
+ -> Bitmap Heap Scan on fkest f2
+ Recheck Cond: (x100 = 2)
+ -> Bitmap Index Scan on fkest_x_x10_x100_idx
+ Index Cond: (x100 = 2)
-> Hash
- -> Seq Scan on fkest f1
- Filter: (x100 = 2)
-(11 rows)
+ -> Bitmap Heap Scan on fkest f1
+ Recheck Cond: (x100 = 2)
+ -> Bitmap Index Scan on fkest_x_x10_x100_idx
+ Index Cond: (x100 = 2)
+(15 rows)
rollback;
--
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 6aeb7cb96..f4c696ca5 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5193,9 +5193,10 @@ List of access methods
btree | uuid_ops | uuid | uuid | 1 | uuid_cmp
btree | uuid_ops | uuid | uuid | 2 | uuid_sortsupport
btree | uuid_ops | uuid | uuid | 4 | btequalimage
+ btree | uuid_ops | uuid | uuid | 6 | uuid_skipsupport
hash | uuid_ops | uuid | uuid | 1 | uuid_hash
hash | uuid_ops | uuid | uuid | 2 | uuid_hash_extended
-(5 rows)
+(6 rows)
-- check \dconfig
set work_mem = 10240;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 0456d48c9..39aa1f89e 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1461,18 +1461,17 @@ select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------
Limit
-> Append
-> Nested Loop
- Join Filter: (t1.tenthous = t2.tenthous)
- -> Seq Scan on tenk1 t1
- -> Materialize
- -> Seq Scan on tenk2 t2
- Filter: (thousand = 0)
+ -> Seq Scan on tenk2 t2
+ Filter: (thousand = 0)
+ -> Index Scan using tenk1_thous_tenthous on tenk1 t1
+ Index Cond: (tenthous = t2.tenthous)
-> Result
-(9 rows)
+(8 rows)
-- Ensure there is no problem if cheapest_startup_path is NULL
explain (costs off)
diff --git a/src/test/regress/sql/alter_generic.sql b/src/test/regress/sql/alter_generic.sql
index de58d268d..4246afefd 100644
--- a/src/test/regress/sql/alter_generic.sql
+++ b/src/test/regress/sql/alter_generic.sql
@@ -310,7 +310,7 @@ ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 6 < (int4, int2); -- ope
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 0 < (int4, int2); -- operator number should be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e296891ca..1d269dc30 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -766,7 +766,7 @@ SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
ORDER BY unique1;
--- Non-required array scan key on "tenthous":
+-- Skip array on "thousand", SAOP array on "tenthous":
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -776,7 +776,7 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
--- Non-required array scan key on "tenthous", backward scan:
+-- Skip array on "thousand", SAOP array on "tenthous", backward scan:
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index df3f336be..18fec46e7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -218,6 +218,7 @@ BTScanPos
BTScanPosData
BTScanPosItem
BTShared
+BTSkipPreproc
BTSortArrayContext
BTSpool
BTStack
@@ -2662,6 +2663,8 @@ SimpleStringListCell
SingleBoundSortItem
Size
SkipPages
+SkipSupport
+SkipSupportData
SlabBlock
SlabContext
SlabSlot
--
2.45.2
Hi,
I started looking at this patch today. The first thing I usually do for
new patches is a stress test, so I did a simple script that generates
random table and runs a random query with IN() clause with various
configs (parallel query, index-only scans, ...). And it got stuck on a
parallel query pretty quick.
I've seen a bunch of those cases, so it's not a particularly unlikely
issue. The backtraces look pretty much the same in all cases - the
processes are stuck either waiting on the conditional variable in
_bt_parallel_seize, or trying to send data in shm_mq_send_bytes.
Attached is the script I use for stress testing (pretty dumb, just a
bunch of loops generating tables + queries), and backtraces for two
lockups (one is EXPLAIN ANALYZE, but otherwise exactly the same).
I haven't investigated why this is happening, but I wonder if this might
be similar to the parallel hashjoin issues, with trying to send data,
but the receiver being unable to proceed and effectively working on the
sender. But that's just a wild guess.
regards
--
Tomas Vondra
On Sat, Sep 7, 2024 at 11:27 AM Tomas Vondra <tomas@vondra.me> wrote:
I started looking at this patch today.
Thanks for taking a look!
The first thing I usually do for
new patches is a stress test, so I did a simple script that generates
random table and runs a random query with IN() clause with various
configs (parallel query, index-only scans, ...). And it got stuck on a
parallel query pretty quick.
I can reproduce this locally, without too much difficulty.
Unfortunately, this is a bug on master/Postgres 17. Some kind of issue
in my commit 5bf748b8.
The timing of this is slightly unfortunate. There's only a few weeks
until the release of 17, plus I have to travel for work over the next
week. I won't be back until the 16th, and will have limited
availability between then and now. I think that I'll have ample time
to debug and fix the issue ahead of the release of 17, though.
Looks like the problem is a parallel index scan with SAOP array keys
can find itself in a state where every parallel worker waits for the
leader to finish off a scheduled primitive index scan, while the
leader itself waits for the scan's tuple queue to return more tuples.
Obviously, the query will effectively go to sleep indefinitely when
that happens (unless and until the DBA cancels the query). This is
only possible with just the right/wrong combination of array keys and
index cardinality.
I cannot recreate the problem with parallel_leader_participation=off,
which strongly suggests that leader participation is a factor. I'll
find time to study this in detail as soon as I can.
Further background: I was always aware of the leader's tendency to go
away forever shortly after the scan begins. That was supposed to be
safe, since we account for it by serializing the scan's current array
keys in shared memory, at the point a primitive index scan is
scheduled -- any backend should be able to pick up where any other
backend left off, no matter how primitive scans are scheduled. That
now doesn't seem to be completely robust, likely due to restrictions
on when and how other backends can pick up the scheduled work from
within _bt_first, at the point that it calls _bt_parallel_seize.
In short, one or two details of how backends call _bt_parallel_seize
to pick up BTPARALLEL_NEED_PRIMSCAN work likely need to be rethought.
--
Peter Geoghegan
On Mon, 9 Sept 2024 at 21:55, Peter Geoghegan <pg@bowt.ie> wrote:
On Sat, Sep 7, 2024 at 11:27 AM Tomas Vondra <tomas@vondra.me> wrote:
I started looking at this patch today.
Thanks for taking a look!
The first thing I usually do for
new patches is a stress test, so I did a simple script that generates
random table and runs a random query with IN() clause with various
configs (parallel query, index-only scans, ...). And it got stuck on a
parallel query pretty quick.I can reproduce this locally, without too much difficulty.
Unfortunately, this is a bug on master/Postgres 17. Some kind of issue
in my commit 5bf748b8.
[...]
In short, one or two details of how backends call _bt_parallel_seize
to pick up BTPARALLEL_NEED_PRIMSCAN work likely need to be rethought.
Thanks to Peter for the description, that helped me debug the issue. I
think I found a fix for the issue: regression tests for 811af978
consistently got stuck on my macbook before the attached patch 0001,
after applying that this patch they completed just fine.
The issue to me seems to be the following:
Only _bt_first can start a new primitive scan, so _bt_parallel_seize
only assigns a new primscan if the process is indeed in _bt_first (as
provided with _b_p_s(first=true)). All other backends that hit a
NEED_PRIMSCAN state will currently pause until a backend in _bt_first
does the next primitive scan.
A backend that hasn't requested the next primitive scan will likely
hit _bt_parallel_seize from code other than _bt_first, thus pausing.
If this is the leader process, it'll stop consuming tuples from
follower processes.
If the follower process finds a new primary scan is required after
finishing reading results from a page, it will first request a new
primitive scan, and only then start producing the tuples.
As such, we can have a follower process that just finished reading a
page, had issued a new primitive scan, and now tries to send tuples to
its primary process before getting back to _bt_first, but the its
primary process won't acknowledge any tuples because it's waiting for
that process to start the next primitive scan - now we're deadlocked.
---
The fix in 0001 is relatively simple: we stop backends from waiting
for a concurrent backend to resolve the NEED_PRIMSCAN condition, and
instead move our local state machine so that we'll hit _bt_first
ourselves, so that we may be able to start the next primitive scan.
Also attached is 0002, which adds tracking of responsible backends to
parallel btree scans, thus allowing us to assert we're never waiting
for our own process to move the state forward. I found this patch
helpful while working on solving this issue, even if it wouldn't have
found the bug as reported.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
v1-0001-Fix-stuck-parallel-btree-scans.patchapplication/octet-stream; name=v1-0001-Fix-stuck-parallel-btree-scans.patchDownload
From db0f4800d3bae875b5b1b262249c12738a243bf7 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 12 Sep 2024 15:23:20 +0100
Subject: [PATCH v1 1/2] Fix stuck parallel btree scans
Before, a backend that called _bt_parallel_seize was not always
guaranteed to be able to move forward on a state where more work
was expected from parallel backends, and handled NEED_PRIMSCAN as
a semi-ADVANCING state. This caused issues when the leader process
was waiting for the state to advance and concurrent backends were
waiting for the leader to consume the buffered tuples they still
had after updating the state to NEED_PRIMSCAN.
This is fixed by treating _bt_parallel_seize()'s status output as
the status of a currently active primitive scan. If _seize is
called from outside _bt_first, and the scan state is NEED_PRIMSCAN,
then we'll end our current primitive scan and set the scan up for
a new primitive scan, eventually hitting _bt_first's call to
_seize.
---
src/backend/access/nbtree/nbtree.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6d090f8739..2b553d1161 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -584,7 +584,8 @@ btparallelrescan(IndexScanDesc scan)
* or _bt_parallel_done().
*
* The return value is true if we successfully seized the scan and false
- * if we did not. The latter case occurs if no pages remain.
+ * if we did not. The latter case occurs if no pages remain in this primitive
+ * index scan.
*
* If the return value is true, *pageno returns the next or current page
* of the scan (depending on the scan direction). An invalid block number
@@ -653,8 +654,10 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
Assert(so->numArrayKeys);
/*
- * If we can start another primitive scan right away, do so.
- * Otherwise just wait.
+ * If we're called from _bt_first and thus are set up to start a
+ * primitive scan, do so. If not, we stop this current primitive
+ * scan by returning false, which sets us up for the call to
+ * _bt_first which can then try to seize this scan again.
*/
if (first)
{
@@ -672,6 +675,13 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
*pageno = InvalidBlockNumber;
exit_loop = true;
}
+ else
+ {
+ so->needPrimScan = true;
+ so->scanBehind = false;
+ *pageno = InvalidBlockNumber;
+ status = false;
+ }
}
else if (btscan->btps_pageStatus != BTPARALLEL_ADVANCING)
{
--
2.46.0
v1-0002-nbtree-add-tracking-of-processing-responsibilitie.patchapplication/octet-stream; name=v1-0002-nbtree-add-tracking-of-processing-responsibilitie.patchDownload
From e04836e0e1c822f586778d489c8b7ea6708feec5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 12 Sep 2024 15:27:02 +0100
Subject: [PATCH v1 2/2] nbtree: add tracking of processing responsibilities in
BTPSD
By tracking which proc is responsible for moving the state forward, we can
make assertions about the scan moving forward, and also assign blame to a
specific backend when we still get stuck.
---
src/backend/access/nbtree/nbtree.c | 36 ++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 2b553d1161..0324860451 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -72,6 +72,10 @@ typedef struct BTParallelScanDescData
* possible states of parallel scan. */
slock_t btps_mutex; /* protects above variables, btps_arrElems */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
+#ifdef USE_ASSERT_CHECKING
+ ProcNumber btps_procnumber; /* procnumber of backend currently
+ * advancing the scan */
+#endif
/*
* btps_arrElems is used when scans need to schedule another primitive
@@ -550,6 +554,9 @@ btinitparallelscan(void *target)
SpinLockInit(&bt_target->btps_mutex);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+#if USE_ASSERT_CHECKING
+ bt_target->btps_procnumber = INVALID_PROC_NUMBER;
+#endif
ConditionVariableInit(&bt_target->btps_cv);
}
@@ -575,6 +582,9 @@ btparallelrescan(IndexScanDesc scan)
SpinLockAcquire(&btscan->btps_mutex);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+#if USE_ASSERT_CHECKING
+ btscan->btps_procnumber = INVALID_PROC_NUMBER;
+#endif
SpinLockRelease(&btscan->btps_mutex);
}
@@ -642,6 +652,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
while (1)
{
+#ifdef USE_ASSERT_CHECKING
+ ProcNumber waitingFor;
+#endif
SpinLockAcquire(&btscan->btps_mutex);
if (btscan->btps_pageStatus == BTPARALLEL_DONE)
@@ -674,6 +687,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
so->scanBehind = false;
*pageno = InvalidBlockNumber;
exit_loop = true;
+#ifdef USE_ASSERT_CHECKING
+ btscan->btps_procnumber = MyProcNumber;
+#endif
}
else
{
@@ -690,12 +706,20 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
* of advancing it to a new page!
*/
btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+#ifdef USE_ASSERT_CHECKING
+ btscan->btps_procnumber = MyProcNumber;
+#endif
*pageno = btscan->btps_scanPage;
exit_loop = true;
}
+#ifdef USE_ASSERT_CHECKING
+ waitingFor = btscan->btps_procnumber;
+#endif
SpinLockRelease(&btscan->btps_mutex);
if (exit_loop || !status)
break;
+
+ Assert(waitingFor != MyProcNumber && waitingFor != INVALID_PROC_NUMBER);
ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
}
ConditionVariableCancelSleep();
@@ -726,6 +750,10 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
SpinLockAcquire(&btscan->btps_mutex);
btscan->btps_scanPage = scan_page;
btscan->btps_pageStatus = BTPARALLEL_IDLE;
+#if USE_ASSERT_CHECKING
+ Assert(btscan->btps_procnumber == MyProcNumber);
+ btscan->btps_procnumber = INVALID_PROC_NUMBER;
+#endif
SpinLockRelease(&btscan->btps_mutex);
ConditionVariableSignal(&btscan->btps_cv);
}
@@ -758,6 +786,11 @@ _bt_parallel_done(IndexScanDesc scan)
SpinLockAcquire(&btscan->btps_mutex);
if (btscan->btps_pageStatus != BTPARALLEL_DONE)
{
+#if USE_ASSERT_CHECKING
+ Assert(btscan->btps_procnumber == MyProcNumber);
+ btscan->btps_procnumber = INVALID_PROC_NUMBER;
+#endif
+
btscan->btps_pageStatus = BTPARALLEL_DONE;
status_changed = true;
}
@@ -792,6 +825,9 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
if (btscan->btps_scanPage == prev_scan_page &&
btscan->btps_pageStatus == BTPARALLEL_IDLE)
{
+#ifdef USE_ASSERT_CHECKING
+ Assert(btscan->btps_procnumber == INVALID_PROC_NUMBER);
+#endif
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
--
2.46.0
On 9/12/24 16:49, Matthias van de Meent wrote:
On Mon, 9 Sept 2024 at 21:55, Peter Geoghegan <pg@bowt.ie> wrote:
...
The fix in 0001 is relatively simple: we stop backends from waiting
for a concurrent backend to resolve the NEED_PRIMSCAN condition, and
instead move our local state machine so that we'll hit _bt_first
ourselves, so that we may be able to start the next primitive scan.
Also attached is 0002, which adds tracking of responsible backends to
parallel btree scans, thus allowing us to assert we're never waiting
for our own process to move the state forward. I found this patch
helpful while working on solving this issue, even if it wouldn't have
found the bug as reported.
No opinion on the analysis / coding, but per my testing the fix indeed
addresses the issue. The script reliably got stuck within a minute, now
it's running for ~1h just fine. It also checks results and that seems
fine too, so that seems fine too.
regards
--
Tomas Vondra
On Thu, Sep 12, 2024 at 10:49 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Thanks to Peter for the description, that helped me debug the issue. I
think I found a fix for the issue: regression tests for 811af978
consistently got stuck on my macbook before the attached patch 0001,
after applying that this patch they completed just fine.
Thanks for taking a look at it.
The fix in 0001 is relatively simple: we stop backends from waiting
for a concurrent backend to resolve the NEED_PRIMSCAN condition, and
instead move our local state machine so that we'll hit _bt_first
ourselves, so that we may be able to start the next primitive scan.
I agree with your approach, but I'm concerned about it causing
confusion inside _bt_parallel_done. And so I attach a v2 revision of
your bug fix. v2 adds a check that nails that down, too. I'm not 100%
sure if the change to _bt_parallel_done becomes strictly necessary, to
make the basic fix robust, but it's a good idea either way. In fact, it
seemed like a good idea even before this bug came to light: it was
already clear that this was strictly necessary for the skip scan
patch. And for reasons that really have nothing to do with the
requirements for skip scan (it's related to how we call
_bt_parallel_done without much care in code paths from the original
parallel index scan commit).
More details on changes in v2 that didn't appear in Matthias' v1:
v2 makes _bt_parallel_done do nothing at all when the backend-local
so->needPrimScan flag is set (regardless of whether it has been set by
_bt_parallel_seize or by _bt_advance_array_keys). This is a bit like
the approach taken before the Postgres 17 work went in:
_bt_parallel_done used to only permit the shared btps_pageStatus state
to become BTPARALLEL_DONE when it found that "so->arrayKeyCount >=
btscan->btps_arrayKeyCount" (else the call was a no-op). With this
extra hardening, _bt_parallel_done will only permit setting BTPARALLEL_DONE when
"!so->needPrimScan". Same idea, more or less.
v2 also changes comments in _bt_parallel_seize. The comment tweaks
suggest that the new "if (!first && status ==
BTPARALLEL_NEED_PRIMSCAN) return false" path is similar to the
existing master branch "if (!first && so->needPrimScan) return false"
precheck logic on master (the precheck that takes place before
examining any state in shared memory). The new path can be thought of
as dealing with cases where the backend-local so->needPrimScan flag
must have been stale back when it was prechecked -- it's essentially the same
logic, though unlike the precheck it works against the authoritative
shared memory state.
My current plan is to commit something like this in the next day or two.
--
Peter Geoghegan
Attachments:
v2-0001-Fix-stuck-parallel-btree-scans.patchapplication/x-patch; name=v2-0001-Fix-stuck-parallel-btree-scans.patchDownload
From 11282515bae8090b30663814c5f91db00488508d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 16 Sep 2024 14:28:57 -0400
Subject: [PATCH v2] Fix stuck parallel btree scans
Before, a backend that called _bt_parallel_seize was not always
guaranteed to be able to move forward on a state where more work
was expected from parallel backends, and handled NEED_PRIMSCAN as
a semi-ADVANCING state. This caused issues when the leader process
was waiting for the state to advance and concurrent backends were
waiting for the leader to consume the buffered tuples they still
had after updating the state to NEED_PRIMSCAN.
This is fixed by treating _bt_parallel_seize()'s status output as
the status of a currently active primitive scan. If _seize is
called from outside _bt_first, and the scan state is NEED_PRIMSCAN,
then we'll end our current primitive scan and set the scan up for
a new primitive scan, eventually hitting _bt_first's call to
_seize.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reported-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com
Backpatch: 17-, where nbtree SAOP execution was enhanced.
---
src/backend/access/nbtree/nbtree.c | 53 +++++++++++++++++++-----------
1 file changed, 33 insertions(+), 20 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f..456a04995 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -585,7 +585,10 @@ btparallelrescan(IndexScanDesc scan)
* or _bt_parallel_done().
*
* The return value is true if we successfully seized the scan and false
- * if we did not. The latter case occurs if no pages remain.
+ * if we did not. The latter case occurs when no pages remain, or when
+ * another primitive index scan is scheduled that caller's backend cannot
+ * start just yet (only backends that call from _bt_first are capable of
+ * starting primitive index scans, which they indicate by passing first=true).
*
* If the return value is true, *pageno returns the next or current page
* of the scan (depending on the scan direction). An invalid block number
@@ -596,10 +599,6 @@ btparallelrescan(IndexScanDesc scan)
* scan will return false.
*
* Callers should ignore the value of pageno if the return value is false.
- *
- * Callers that are in a position to start a new primitive index scan must
- * pass first=true (all other callers pass first=false). We just return false
- * for first=false callers that require another primitive index scan.
*/
bool
_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
@@ -616,13 +615,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
{
/*
* Initialize array related state when called from _bt_first, assuming
- * that this will either be the first primitive index scan for the
- * scan, or a previous explicitly scheduled primitive scan.
- *
- * Note: so->needPrimScan is only set when a scheduled primitive index
- * scan is set to be performed in caller's worker process. It should
- * not be set here by us for the first primitive scan, nor should we
- * ever set it for a parallel scan that has no array keys.
+ * that this will be the first primitive index scan for the scan
*/
so->needPrimScan = false;
so->scanBehind = false;
@@ -630,8 +623,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
else
{
/*
- * Don't attempt to seize the scan when backend requires another
- * primitive index scan unless we're in a position to start it now
+ * Don't attempt to seize the scan when it requires another primitive
+ * index scan, since caller's backend cannot start it right now
*/
if (so->needPrimScan)
return false;
@@ -653,12 +646,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
{
Assert(so->numArrayKeys);
- /*
- * If we can start another primitive scan right away, do so.
- * Otherwise just wait.
- */
if (first)
{
+ /* Can start another primitive scan right away, so do so */
btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
for (int i = 0; i < so->numArrayKeys; i++)
{
@@ -668,11 +658,25 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
array->cur_elem = btscan->btps_arrElems[i];
skey->sk_argument = array->elem_values[array->cur_elem];
}
- so->needPrimScan = true;
- so->scanBehind = false;
*pageno = InvalidBlockNumber;
exit_loop = true;
}
+ else
+ {
+ /*
+ * Don't attempt to seize the scan when it requires another
+ * primitive index scan, since caller's backend cannot start
+ * it right now
+ */
+ status = false;
+ }
+
+ /*
+ * Either way, update backend local state to indicate that a
+ * pending primitive scan is required
+ */
+ so->needPrimScan = true;
+ so->scanBehind = false;
}
else if (btscan->btps_pageStatus != BTPARALLEL_ADVANCING)
{
@@ -731,6 +735,7 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
void
_bt_parallel_done(IndexScanDesc scan)
{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
bool status_changed = false;
@@ -739,6 +744,13 @@ _bt_parallel_done(IndexScanDesc scan)
if (parallel_scan == NULL)
return;
+ /*
+ * Should not mark parallel scan done when there's still a pending
+ * primitive index scan
+ */
+ if (so->needPrimScan)
+ return;
+
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
@@ -747,6 +759,7 @@ _bt_parallel_done(IndexScanDesc scan)
* already
*/
SpinLockAcquire(&btscan->btps_mutex);
+ Assert(btscan->btps_pageStatus != BTPARALLEL_NEED_PRIMSCAN);
if (btscan->btps_pageStatus != BTPARALLEL_DONE)
{
btscan->btps_pageStatus = BTPARALLEL_DONE;
--
2.45.2
Hi,
I've been looking at this patch over the couple last days, mostly doing
some stress testing / benchmarking (hence the earlier report) and basic
review. I do have some initial review comments, and the testing produced
some interesting regressions (not sure if those are the cases where
skipscan can't really help, that Peter mentioned he needs to look into).
review
------
First, the review comments - nothing particularly serious, mostly just
cosmetic stuff:
1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch
- I find the places that increment "nsearches" a bit random. Each AM
does it in entirely different place (at least it seems like that to me).
Is there a way make this a bit more consistent?
- I find this comment rather unhelpful:
uint64 btps_nsearches; /* instrumentation */
Instrumentation what? What's the counter for?
- I see _bt_first moved the pgstat_count_index_scan, but doesn't that
mean we skip it if the earlier code does "goto readcomplete"? Shouldn't
that still count as an index scan?
- show_indexscan_nsearches does this:
if (scanDesc && scanDesc->nsearches > 0)
ExplainPropertyUInteger("Index Searches", NULL,
scanDesc->nsearches, es);
But shouldn't it divide the count by nloops, similar to (for example)
show_instrumentation_count?
2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch
- Admittedly very subjective, but I find the "oppoDirCheck" abbreviation
rather weird, I'd just call it "oppositeDirCheck".
3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch
- nothing
4) v6-0004-Add-skip-scan-to-nbtree.patch
- indices.sgml seems to hahve typo "Intevening" -> "Intervening"
- It doesn't seem like a good idea to remove the paragraph about
multicolumn indexes and replace it with just:
Multicolumn indexes should be used judiciously.
I mean, what does judiciously even mean? what should the user consider
to be judicious? Seems rather unclear to me. Admittedly, the old text
was not much helpful, but at least it gave some advice.
But maybe more importantly, doesn't skipscan apply only to a rather
limited subset of data types (that support increment/decrement)? Doesn't
the new wording mostly ignore that, implying skipscan applies to all
btree indexes? I don't think it mentions datatypes anywhere, but there
are many indexes on data types like text, UUID and so on.
- Very subjective nitpicking, but I find it a bit strange when a comment
about a block is nested in the block, like in _bt_first() for the
array->null_elem check.
- assignProcTypes() claims providing skipscan for cross-type scenarios
doesn't make sense. Why is that? I'm not saying the claim is wrong, but
it's not clear to me why would that be the case.
costing
-------
Peter asked me to look at the costing, and I think it looks generally
sensible. We don't really have a lot of information to base the costing
on in the first place - the whole point of skipscan is about multicolumn
indexes, but none of the existing extended statistic seems very useful.
We'd need some cross-column correlation info, or something like that.
It's an interesting question - if we could collect some new statistics
for multicolumn indexes (say, by having a way to collect AM-specific
stats), what would we collect for skipscan?
There's one thing that I don't quite understand, and that's how
btcost_correlation() adjusts correlation for multicolumn indexes:
if (index->nkeycolumns > 1)
indexCorrelation = varCorrelation * 0.75;
That seems fine for a two-column index, I guess. But shouldn't it
compound for indexes with more keys? I mean, 0.75 * 0.75 for third
column, etc? I don't think btcostestimate() does that, it just remembers
whatever btcost_correlation() returns.
Anyway, the overall costing approach seems sensible, I think. It assumes
things we assume in general (columns/keys are considered independent),
which may be problematic, but this is the best we can do.
The only alternative approach I can think of is not to adjust the
costing for the index scan at all, and only use this to enable (or not
enable) the skipscan internally. That would mean the overall plan
remains the same, and maybe sometimes we would think an index scan would
be too expensive and use something else. Not great, but it doesn't have
the risk of regressions - IIUC we can disable the skipscan at runtime,
if we realize it's not really helpful.
If we're concerned about regressions, I think this would be the way to
deal with them. Or at least it's the best idea I have.
testing
-------
As usual, I wrote a bash script to do a bit of stress testing. It
generates tables with random data, and then runs random queries with
random predicates on them, while mutating a couple parameters (like
number of workers) to trigger different plans. It does that on 16,
master and with the skipscan patch (with the fix for parallel scans).
I've uploaded the script and results from the last run here:
https://github.com/tvondra/pg-skip-scan-tests
There's the "run-mdam.sh" script that generates tables/queries, runs
them, collects all kinds of info about the query, and produces files
with explain plans, CSV with timings, etc.
Not all of the queries end up using index scans - depending on the
predicates, etc. it might have to use seqscan. Or maybe it only uses
index scan because it's forced to by the enable_* options, etc.
Anyway, I ran a couple thousand such queries, and I haven't found any
incorrect results (the script compares that between versions too). So
that's good ;-)
But my main goal was to see how this affects performance. The tables
were pretty small (just 1M rows, maybe ~80MB), but with restarts and
dropping caches, large enough to test this.
And generally the results seem good. You can either inspect the CSV with
raw results (look at the script to undestand what the fields are), or
check the attached PDF with a pivot table summarizing them.
As usual, there's a heatmap on the right side, comparing the results for
different versions (first "master/16" and then "skipscan/master"). Green
means "speedup/good" and red meand "regression/bad".
Most of the places are "white" (no change) or not very far from it, or
perhaps "green". But there's also a bunch of red results, which means
regression (FWIW the PDF is filtered only to queries that would actually
use the executed plans without the GUCs).
Some of the red placees are for very short queries - just a couple ms,
which means it can easily be random noise, or something like that. But a
couple queries are much longer, and might deserve some investigation.
The easiest way is to look at the "QID" column in the row, which
identifies the query in the "query" CSV. Then look into the results CSV
for IDs of the runs (in the first "SEQ" column), and find the details in
the "analyze" log, which has all the plans etc.
Alternatively, use the .ods in the git repository, which allows drill
down to results (from the pivot tables).
For example, one of the slowed down queries is query 702 (top of page 8
in the PDF). The query is pretty simple:
explain (analyze, timing off, buffers off)
select id1,id2 from t_1000000_1000_1_2
where NOT (id1 in (:list)) AND (id2 = :value);
and it was executed on a table with random data in two columns, each
with 1000 distinct values. This is perfectly random data, so a great
match for the assumptions in costing etc.
But with uncached data, this runs in ~50 ms on master, but takes almost
200 ms with skipscan (these timings are from my laptop, but similar to
the results).
-- master
Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
t_1000000_1000_1_2 (cost=0.96..20003.96 rows=1719 width=16)
(actual rows=811 loops=1)
Index Cond: (id2 = 997)
Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
Rows Removed by Filter: 163
Heap Fetches: 0
Planning Time: 7.596 ms
Execution Time: 28.851 ms
(7 rows)
-- with skipscan
Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16)
(actual rows=811 loops=1)
Index Cond: (id2 = 997)
Index Searches: 1007
Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
Rows Removed by Filter: 163
Heap Fetches: 0
Planning Time: 3.730 ms
Execution Time: 238.554 ms
(8 rows)
I haven't looked into why this is happening, but this seems like a
pretty good match for skipscan (on the first column). And for the
costing too - it's perfectly random data, no correllation, etc.
regards
--
Tomas Vondra
Attachments:
optimal.pdfapplication/pdf; name=optimal.pdfDownload
%PDF-1.7
%����
4 0 obj
<< /Length 5 0 R
/Filter /FlateDecode
>>
stream
x��}K�%��������\����X0,��e�EC�ruu�5��Ru����A2&�<�j���U�$O|d0�������������%u�����^d���?�~�^�~���_��R�*U�?+��)�`o^� ����������&��M������������j�����}5�/?�����?����R����N��n�����������o.���+�~��E��H�K���9�2��01��F���>�>����^�H�i3*x�|��:m�'%�U4���+'��.�7��RH����dE�QKW�����V��6���G��(|d�w�)����LF2#�r���Q�� &0 m�u�9
�w��� |��F���|�[ �wR=5��J��P����%�2���W�)�E}�B�l\p#9�r���3�������x����o�x�'`*��'j �j��dcj�PF�R�WI��N:�?0)�!��Q�Q� vF6��ud�2,�dDTh����JF�K+�f���}7H�������
����~���M��@�Qd9.���cYS�I�-A��H` ����dP��k�~�[� �7�����h���V�
:�1������.�x���1����4~ZlW���m��u���g�o�xK,��T���[&�uaLR�pC��y�����vf��\c�k=?�1�;%�Ygt_@�2B;�W�F<pA��TP-� ��5���36t��1�����TO��(�t�R�Zti�-@�d�(�B
�q�6��_M�6��k��Z$��6Z�d��v)b�6f,��b���y� �'��X��(��tm��o�R"�%i!�JR7T��*�8��@�j�j�eJ5 ��}��g�c��n�]��%1��w�8�>>~hk����<��f�N�z�Lw�ko���_#�������k���X�� �IkQ���ej����!���vn+�;H������c��^��������:� 5V���l���r����� g��E�f?7nJ()^&#��v�y�E���;4p"�c������R6J����� ~��i��X9l�RD'��f�Se,{G�rZd{���a$��SV#���L$�����y:5��� VpSb9=&�.S ��������������
�P�%LWW�vLW8, ���u��p���"���Q���j�����#�'��x2x �z�������� ��(�_FE��'���@�����c
@
��y��c\���#z��{o"�{�de�6��'�;q��r��5=5������y���3�~ ���D�M�1�eH�Cl����+
�w
�"�� ^��DsE�w�8���v��Q����8b~�%�~��������v+� �&[\K{������ u�el���W���[\_\�8�B)C�E�S�SE�l�Eq��r�NL�Cd���G���\�������������!��R*x��Q�h�]����<�����6��12��cf�0~������D4D�����OI�
�tR��T�N�2>2�E�5%lp�mM�ZJ��r?:�J��v���\���� )�������f��uB�@�t�Y:�s�C�1Hk48QE5���#��{��N�o4�����V��k�*lr��8?I6���-�+@�~�
n�66�8��N���w�h���m�e��\O���
�8`���Sa�0��o��c���CJ)��k�
�w��$��Jy8��#��AO��� <3�����6�1�F8�q�������;dg���t��l���Q��>�O�=D��QC�*��2�_��0��_�����T���k�P�����g�|4~�f������������AtA.<��]m���HH��u<��D���H�)0�����
��@������ �WM��
��������wN���m`��i��`dC��V�3T��?0��
K�j��"�E���w4�����R�$U�=4%E���(��.h�K^����}6j�=��=J]v��GF�Sj�\a����z�%�=YC��������;����v>���������@��*�tB��
�����/ �,!�ML���mc����� �p��7�~[g!���#��)}�l��K����X�H|��.�������7">��G�g�S-A�d�&d�/~o���Z��Iq����L����@�4p4��K�l�&[�P�]����#_F���Q1_F r#����� 7�E��?G������,%����U�D�$W
��+����lr��c��.���5���H��������H�6����n�� ��`F�E��� �%�<�wuiG�����]�~��d�.Y�����Z����EM��N�
��tf� ����J6��-�\�|>!�u?\�+nJ���sTIADK�n2�x�$p3!����Q�#����q_ ?�'�7W |2`nKj���X���V���r�v���|e���[4����m���)�bu��md���9��1�<:)B����DQ����N�}�N����<�p�c}��+��w>i>�����"���L�i �1��t|��l
��-Cd��XoW�K�8����v��x�8i����v���Xlr G8� �}��.u�#�<��P='Z�mxq�3��8��y�W 5O�
`�76��N�?�] u�E�+��I� qR-����'xhJh�m��Q"Z��r%�^����@�dS?�2��qa ~�t}j����E�D�$c�w�S�Y��M�\� W�;��pa�]��)1�9�qW�5�[%��V77 D���q�h�0�'9?��nm�4G���a�&��=���V���t"��h$8�E5��Ag�9w�<s>�"�n&���:>��T1��6#�'d�MM����:��
�;nJ�:�m]�^(Cz���A/n�L��B6��f3�����FM�-�I� �<%���%�,�B)��9y������pa��nJ,~UNDG(M��3E�geE6�\0c��=������\�����~���hM�"(Bg/>O�0g��M$|^;LS6y�#d ���[���!�p��U�K�|�$d����
�h��
����-G��i��V?p8p�����#�^4�<d�������=�{�`
�a��"~T��@'t����4~m|��N.���:�����xZ�wr|�8��.\�f�
�._v���1G�������i/?dN<p�n ��[�|'s!f'�\h�y�I$AP��u0w��c����w|�{'��<j �A���@�r���E�%V��$�#/���5' �xj��G;�X ��~��N >;cz>i�A�����M4p]����������l��x�h���e����Hw��J"O��O���^smpOT1]1%��*����0��
j�vp�oJ����B���U=w�`�-A�T�k�V~ei<�;����������* �)�@���@�� ����@���c��/S�D ��0g$�d��P�L>b4</�)PA�h��[4��-@r��Gxq�~�����k^9�$G�i�#;
: J7&��x6�;'��+|'��#��sZ�X�H:�h�&l�����z,����O��b�NI=*�NkA=p�z��r�R���r�&